As the data that can be processed with mmseqs is generally quite large. The tsv output is usually very large. I wonder if it would be difficult to create hdf files or other more efficient file formats e.g. parquet to store the data.
Concretely, I'm thinking about the output of linclust. But there it would only make sense if the cluster id would be a categorical or so.
Yes, the TSV file can take quite some space. We often thought about binary based formats but decided to not do this for the sake of usability. The problem with hdf is that it can not be processed in parallel.
As the data that can be processed with mmseqs is generally quite large. The tsv output is usually very large. I wonder if it would be difficult to create hdf files or other more efficient file formats e.g. parquet to store the data.
Concretely, I'm thinking about the output of linclust. But there it would only make sense if the cluster id would be a categorical or so.
What do you think?