soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 197 forks source link

Feature request: mmseqs createhdf #524

Open SilasK opened 2 years ago

SilasK commented 2 years ago

As the data that can be processed with mmseqs is generally quite large. The tsv output is usually very large. I wonder if it would be difficult to create hdf files or other more efficient file formats e.g. parquet to store the data.

Concretely, I'm thinking about the output of linclust. But there it would only make sense if the cluster id would be a categorical or so.

What do you think?

martin-steinegger commented 2 years ago

Yes, the TSV file can take quite some space. We often thought about binary based formats but decided to not do this for the sake of usability. The problem with hdf is that it can not be processed in parallel.