contig_id in any of the files cannot be mapped back to original FASTA

phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies

Apache License 2.0

124 stars 33 forks source link

contig_id in any of the files cannot be mapped back to original FASTA #99

Open schorlton opened 2 years ago

schorlton commented 2 years ago

Thanks for the great tool and your help both on and offline.

I'm curious about the decision to fix_fasta_header before processing the FASTA. This makes it basically impossible (as far as I can tell) to combine MOB-Suite outputs with those of other tools dowstream, as it has irreversibly modified the actual contig IDs and then puts those out as new contig IDs in all of its outputs. Was this a conscious decision? Is it possible to leave the contig IDs as the FASTA standard which is the first word before a whitespace in the FASTA header?

It seems like another approach would be to parse the input FASTA file with biopython, maintaining separation between the contig ID and description. As far as I can tell, it seems that this is the only place that actually needs the sequence description: https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/mob_recon.py#L1138-L1142

I'd be happy to open a PR if that makes sense to you? Thanks again!

jrober84 commented 2 years ago

So the reason for the fixing of fasta headers is that spaces and other characters can cause issues with tools like blast but there are a variety of tools that put information in the fasta headers delimited by spaces. Al of the blast calls requires mapping back to the contig identifiers and spaces cause issues. We were originally planning on stripping the headers entirely in favor of a sequential integer id but compromised on this since MOB-suite does use the circularity information and it would be harder to map things back. I can put this in as a feature request to have the original headers maintained in the outputs.

schorlton commented 2 years ago

Sorry for my digging further, but how does it cause issues with BLAST? As far as I'm aware, BLAST outputs contig ID as per FASTA standard in many/most of its output formats.

Thanks again, would really appreciate this addition!

jrober84 commented 2 years ago

The issue #87 shows the problem with super long headers. I have made a change to the code that all sequences will be renamed in the format {int}_{md5}_circular={status} for all blast and mash searches. Then the program will relabel all of the sequences to their original contig id's. This should solve your issue, as I agree there is a problem with mapping contigs back for integrating results from other tools. version 3.1.0 will implement this

schorlton commented 2 years ago

Thanks @jrober84 for the effort! One suggestion: I'd suggest putting the sequence ID in the contig_id field, and not the whole FASTA header. Eg. with this fix it puts contig_1 circular=true in the contig_id, but I'd suggest just contig_1. There are a couple reasons for this:

The field is named ID, which is only the first field of a FASTA header by convention
The field may get excessively long in the TSV
More importantly, if someone puts a tab in their FASTA header, it will break mob-recon's output as there will varying numbers of columns

kbessonov1984 commented 2 years ago

Thanks for sharing reflections on contig_id format. I think that both providing a full length fasta header as contig ID or just using identifier before the first space has its value because this way user might encode additional information that might be of use. On the other hand the contig_id might get long to read, but the output should not be affected by length of presence of tabs and correctly be written to the report file by the writeReport(data_list, header, outfile) function. All subsequent tools can just remove this information if it causes compatibility issues. It is easier to remove data than to add in my opinion. In addition, what if user decides to remove all spaces and substitute them by _ symbols. Then in this case we would not able to trim the contig_id. I think the best compromise is to create new output parameter --contig_id_standardize that would output standard contig_ids truncated to the first space (if present).

schorlton commented 2 years ago

Thanks @kbessonov1984. While the output may be "correctly" written to TSV by python, it won't be parseable as it will have variable numbers of columns depending on if a FASTA sequence had tabs in its sequence header. Eg. I don't think you'll be able to read the mob-recon output back into pandas or us python's csv.DictReader, which is pretty problematic.