metagentools / GraphBin2

☯️🧬 Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs
https://graphbin2.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
33 stars 3 forks source link

contig naming issue #2

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 4 years ago

I'm running graphbin2 with spades input and getting the following error:

Please make sure that you have provided the correct assembler type and the correct path to the binning result file in the correct format.
Exiting GraphBin2... Bye...!

I checked the code, and a realized that:

contig_num = contigs_map_rev[int(re.search('%s(.*)%s' % (start, end), row[0]).group(1))]

...is expecting a bin.csv file with contigs simply labeled as:

NODE_1,1
NODE_2,1
NODE_3,1
NODE_4,2
NODE_5,2

...but spades names contigs as:

NODE_18_length_62406_cov_15.570288
NODE_37_length_46852_cov_20.727739
NODE_157_length_24733_cov_33.082097
NODE_241_length_18536_cov_12.750717
NODE_303_length_15717_cov_28.974141
NODE_351_length_14065_cov_26.651249
NODE_605_length_9174_cov_149.020726
NODE_669_length_8561_cov_15.148483
NODE_762_length_7725_cov_22.829726
NODE_773_length_7642_cov_3.858310

So do the contig names in the output of spades (contig fasta & assembly graph) need to be changed from NODE_\d+_length_\d+_cov\d+.\d+ to NODE_\d+, or do the nodes just need to be changed in the --binned input file?

Why not just parse the entire, original contig name:

contig_num = contigs_map_rev[int(re.search('%s(.*)%s.+' % (start, end), row[0]).group(1))]
# or better yet:
contig_num = contigs_map_rev[int(row[0].split('_')[1])]

Also, a blanket except: with a generic error message and no traceback will make it hard for users to figure out what the problem is. Example from the code:

try:
    with open(contig_bins_file) as contig_bins:
        readCSV = csv.reader(contig_bins, delimiter=',')
        for row in readCSV:
            start = 'NODE_'
            end = ''
            contig_num = contigs_map_rev[int(re.search('%s(.*)%s' % (start, end), row[0]).group(1))]
            bin_num = int(row[1])-1
            bins[bin_num].append(contig_num)

except:
    print("\nPlease make sure that you have provided the correct assembler type and the correct path to the binning result file in the correct format.")
    print("Exiting GraphBin2... Bye...!")
    sys.exit(1)
nick-youngblut commented 4 years ago

I tried converting the full contig names:

NODE_18_length_62406_cov_15.570288
NODE_37_length_46852_cov_20.727739
NODE_157_length_24733_cov_33.082097
...

...to the truncated version as specified in README:

NODE_1,1
NODE_2,1
NODE_3,1

This seems to have worked. I'm guessing that graphbin2 automatically deals with the extra spades contig naming info in the contigs fasta and gfa (then why not also in the --binned file)?

The output is also in the same truncated contig name format:

NODE_1,1
NODE_2,1
NODE_3,1

...which then affects downstream mapping of these nodeIDs to the contig fasta (eg., when using DAS-Tool).

If graphbin2 requires the truncated node naming, then it would be helpful if it wrote a new version of the contig fasta with truncated names.

Vini2 commented 4 years ago

Hello @nick-youngblut,

Thanks for posting this issue. As mentioned in the input format section, the current version of GraphBin2 requires the user to input truncated contig ids as shown. However, I agree with you that it is better to let users input the original contig ids rather than the truncated ones. I will update the code accordingly to take in and output the original contig ids. Until then, I will leave this issue open.

Vini2 commented 3 years ago

Fixed the contig naming issue for SPAdes version of GraphBin2. Now the user can input the original contig names provided by SPAdes in the initial binning result. GraphBin2 output will also contain the original contig names.

Commit ID: 0f6f5a4677c4f7fa5989f556966526473308ae0d

Closing issue after fixing.