linxingchen / cobra

A tool to raise the quality of viral genomes assembled from short-read metagenomes via resolving and joining of contigs fragmented during de novo assembly.
MIT License
56 stars 8 forks source link

Missing `COBRA_retrieved_for_joining` contig file #30

Open Vini2 opened 5 months ago

Vini2 commented 5 months ago

Hello Cobra authors!

Thanks for building this tool.

I keep getting this error of a contig file not being created.

Traceback (most recent call last):
  File "/home/mall0133/miniconda3/envs/cobra/bin/cobra-meta", line 10, in <module>
    sys.exit(main())
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 1747, in main
    '\t'.join([contig, str(header2len[contig]), summarize(contig), query2current[contig]]) + '\n')
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 575, in summarize
    b = count_seq('COBRA_retrieved_for_joining/{0}_retrieved.fa'.format(item))  # number of retrieved contigs
  File "/home/mall0133/miniconda3/envs/cobra/lib/python3.8/site-packages/cobra.py", line 482, in count_seq
    a = open(fasta_file, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'COBRA_retrieved_for_joining/NODE_1990_length_17225_cov_13.396421_retrieved.fa'

Every time I try re-running, it fails at this stage with a FileNotFoundError for a different contig.

This is my Cobra command. I'm using the latest version (version 1.2.3).

cobra-meta -f contigs.fasta -q query_contigs.fasta -c coverage.tsv -m sorted_reads.bam -a metaspades -mink 21 -maxk 127

My query_contigs.fasta file contains about 2300 contig sequences. I've also attached the log file of the run. log.txt

Any advice on how to fix this error and get Cobra running will be appreciated.

Thanks!

linxingchen commented 5 months ago

Hi,

Sorry to hear that you have problem running COBRA.

We noticed this issue and have been working on it.

Could you please let me know (1) did you re-run on the same sample? (2) every time the FileNotFoundError issue is for a different contig? (3) did you meet this error for other samples?

Thank you.

Best, LINXING

Vini2 commented 5 months ago

Hi @linxingchen,

Thanks for the quick reply!

(1) I removed the Cobra output folder and re-ran on the same sample. (2) Yes, I've run Cobra 3 times and every time it gives a different contig: NODE_21535_length_4721_cov_9.622333, NODE_1990_length_17225_cov_13.39642 and NODE_13930_length_6078_cov_11.714670. (3) I haven't tested on any other sample yet.

Thanks!

linxingchen commented 5 months ago

hi @Vini2 thanks for the information.

If possible please share me the "potential_join_path" file so I can have a look if these three are in the same path. We have been working on this but please give us more time to fix it. Thank you.

Vini2 commented 5 months ago

Hi @linxingchen,

Here is the file you requested. COBRA_potential_joining_paths.txt

Let me know if you need further details.

Thanks for taking a look at this issue.

linxingchen commented 5 months ago

Thank you.

These three contigs are not in the same path, I have no idea why the file does not existed. I am wondering if you could run another sample and see if it will happen again.

Vini2 commented 5 months ago

Hi @linxingchen,

I did a bit of debugging and I think some edge cases cause the errors.

This time I got that NODE_30571_length_3853_cov_16.548864_retrieved.fa could not be found.

In line 1747 when calling summarize(contig), the error appears in line 575 which is in the else block within the else block. You have

item = is_subset_of(contig)

Then it tries to count sequences from item.

Here contig is NODE_111132_length_1779_cov_9.005448 which is a subset of NODE_30571_length_3853_cov_16.548864 (or the other way around). Now item is NODE_30571_length_3853_cov_16.548864 which is an extended partial query. Hence, it does not get retrieved for joining (the file is not created).

I came across some more edge cases which I couldn't test out in detail.

How would you recommend running these contigs? I'm not sure how to check if a contig is a subset or not. Would running them one by one be better? Appreciate any suggestions.

I'll try running on another sample as well.

Thanks!

linxingchen commented 5 months ago

Hi @Vini2,

Thanks for taking a deep look at the issue. Sorry for my delayed reply.

Did you still run the previous same sample?

You will avoid the error if you run them one by one. However, at the end of day, I have to fix this issue, hopefully in the next week (stuck by grant proposal for now).

Best, LINXING

linxingchen commented 4 months ago

Hi @Vini2,

Sorry for my delayed reply on this, @Hocnonsense re-wrote most parts of the script, it will be great if you could try it (enclosed) and see if your issue has been resolved or not.

Thank you.

cobra_Hocnonsense.py.zip