soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

mmseqs_update: Problem with data file. #4

Closed narsapuramvijaykumar closed 8 years ago

narsapuramvijaykumar commented 8 years ago

When I am trying to run mmseq_update on some data set below error was popping up. ERROR: mmseqs_update: Problem with data file. Is it empty or is another process readning it?: Invalid argument commons/DBReader.cpp:49 ffindex_index_parse: /anno/narsapvi/fern_clustering/run/tmp/A.index: Invalid argument What can be the possible solution for this..!!

milot-mirdita commented 8 years ago

Hi and thank you for your bug report.

Would you mind switching to MMseqs2, our new and much improved release? MMseqs1 is not supported anymore.

With that said, it looks like your A.index file is corrupted. Check if it has the expected number of lines (=number of clusters) and also if for every line there are three values.

narsapuramvijaykumar commented 8 years ago

I am upgrading my system(cmake, g++ etc) to make suitable for MMseq2 installation. Once it is done I will switch to MMseqs2.

The A.index file is totally empty. I am unable to figure out the possible solution. I have tried multiple times with all input files formatted correctly.

Thanks in advance.

martin-steinegger commented 8 years ago

You can use MMseqs2 without any requirements using the static compiled verison.

    wget https://mmseqs.com/latest/mmseqs-static_sse41.tar.gz 
    tar xvzf mmseqs-static_avx2.tar.gz
    export PATH=$(pwd)/mmseqs/bin/:$PATH

Can you send me how you called mmseqs_update? What kind of data did you use?

narsapuramvijaykumar commented 8 years ago

Firstly I have converted the fasta format to ffindex fastatoffindex old_DB.fasta old_DB fastatoffindex new_DB.fasta new_DB mmseq_cluster old_DB old_DB_clu tmp/ --cascaded Above generated old_DB_clu, I have given as oldDB_clustering to mmseq_update mmseq_update old_DB new_DB old_DB_clu new_DB_clu tmp/ The above last command throws below error. mmseqs_update: Problem with data file. Is it empty or is another process readning it?: Invalid argument commons/DBReader.cpp:49 ffindex_index_parse: tmp//A.index: Invalid argument

martin-steinegger commented 8 years ago

Is it possible that old_DB and new_DB do not have any keys in common? A.index should be just empty if there is no overlap between both databases.

narsapuramvijaykumar commented 8 years ago

There is no identical fasta ids/keys between old_DB and new_DB

martin-steinegger commented 8 years ago

Okey we do not handle this. We will have a look if this also occures in MMseqs2.
You could add one sequences from the old_DB to the new DB as a quick fix.

narsapuramvijaykumar commented 8 years ago

I have tried to update my clusters with the static compiled verison of mmseq2. command used for updating clusters.

mmseqs clusterupdate old_DB new_DB old_DB_clu new_DB_clu tmp2/

Errors noticed at various levels:

Failed to mmap memory dataSize=0 File=tmp2//NEWDB.newSeqs
Failed to mmap memory dataSize=0 File=tmp2//NEWDB.newSeqs
mv: cannot stat `/tmp2/aln_4': No such file or directory
Failed to mmap memory dataSize=0 File=tmp2//NEWDB.newSeqs
Could not open data file tmp2//newSeqsHits.swapped.all!
awk: cmd. line:1: fatal: cannot open file `tmp2//newSeqsHits.index' for reading (No such file or directory)
mv: cannot stat `tmp2//aln_*': No such file or directory
mv: cannot stat `tmp2//pref_*': No such file or directory
mv: cannot stat `tmp2//clu_*': No such file or directory
mv: cannot stat `tmp2//input_*': No such file or directory
Failed to mmap memory dataSize=0 File=tmp2//toBeClusteredSeparately
Failed to mmap memory dataSize=0 File=tmp2//toBeClusteredSeparately
awk: cmd. line:1: fatal: cannot open file `tmp2//clu_redundancy.index' for reading (No such file or directory)
Failed to mmap memory dataSize=0 File=tmp2//toBeClusteredSeparately
Could not open data file tmp2//input_step_redundancy!
Could not open data file tmp2//input_step_redundancy!
Could not open data file tmp2//input_step_redundancy!
Failed to mmap memory dataSize=0 File=tmp2//toBeClusteredSeparately
mv: cannot stat `tmp2//clu': No such file or directory
mv: cannot stat `tmp2//clu.index': No such file or directory
narsapuramvijaykumar commented 8 years ago

After adding a single sequence record from old_DB to new_DB, the mmseq program seems to be working. Thanks for the inputs martin-steinegger. In future, if the above case (no identical keys between old and new DB) is handled either in MMseq or MMseqs2 that will be a great help for us.

martin-steinegger commented 8 years ago

We fixed the updating in MMseqs2. In order to use MMseqs2 you need to recluster you database since the database format changed. To recluster you have to recreate the sequence database with createdb and call cluster afterwards. You can use update this clustering than. Sorry for the inconvenience. Please open another issue if there is still a problem.

altaetran commented 6 years ago

Having problems with this currently.

Normal output from mmseqs cluster is

MMseqs Version: 8c0c7fb86ce154b236d6fe294811de3b09850fba Sub Matrix blosum62.out Alphabet size 13 Seq. Id Threshold 0.8 Kmer per sequence 20 Mask Residues 0 Coverage Mode 0 K-mer size 10 Coverage threshold 0.8 Max. sequence length 32000 Shift hash 5 Split Memory Limit 0 Include only extendable false Threads 20 Verbosity 3

Database type: Aminoacid V -> I M -> L Q -> E T -> S R -> K Y -> F S -> A N -> D Reduced amino acid alphabet: A C D E F G H I K L P W X

Needed memory (4160 byte) of total memory (243728993894 byte) Process file into 1 parts Generate k-mers list 0

Time for fill: 0 h 0 m 0s Done. Sort kmer ... Done. Time for sort: 0 h 0 m 0s Sort by rep. sequence ... Done Time for sort: 0 h 0 m 0s Time for fill: 0 h 0 m 0s Time for merging files: 0 h 0 m 0 s Time for processing: 0 h 0 m 0s Rescore diagonals.

However, I get the following error when there are some identical sequences:

MMseqs Version: 8c0c7fb86ce154b236d6fe294811de3b09850fba Sub Matrix blosum62.out Alphabet size 13 Seq. Id Threshold 0.8 Kmer per sequence 20 Mask Residues 0 Coverage Mode 0 K-mer size 10 Coverage threshold 0.8 Max. sequence length 32000 Shift hash 5 Split Memory Limit 0 Include only extendable false Threads 20 Verbosity 3

Database type: Aminoacid V -> I M -> L Q -> E T -> S R -> K Y -> F S -> A N -> D Reduced amino acid alphabet: A C D E F G H I K L P W X

Needed memory (3200 byte) of total memory (243728993894 byte) Process file into 1 parts Generate k-mers list 0

Time for fill: 0 h 0 m 0s Done. Sort kmer ... Done. Time for sort: 0 h 0 m 0s Sort by rep. sequence ... Done Time for sort: 0 h 0 m 0s ~/tmp/2230240821590219627/linclust/10911929691229894817/linclust.sh: line 18: 28296 Segmentation fault (core dumped) $MMSEQS kmermatcher "$INPUT" "$3/pref" ${KMERMATCHER_PAR} Rescore diagonals.

Has anyone seen this before? Thanks!

martin-steinegger commented 6 years ago

@altaetran does the problem also occur with the newest version MMseqs2? Is the issue related to updating?

altaetran commented 6 years ago

I'm not sure, I installed a few months ago at most though, and I don't have the flexibility to reinstall a new mmseqs version at the moment.

altaetran commented 6 years ago

I manually removed the redundancies before entering into mmseqs and it worked again. I suspect there is something off about the redundancy filter in mmseqs.