Open schorlton opened 2 years ago
So the reason for the fixing of fasta headers is that spaces and other characters can cause issues with tools like blast but there are a variety of tools that put information in the fasta headers delimited by spaces. Al of the blast calls requires mapping back to the contig identifiers and spaces cause issues. We were originally planning on stripping the headers entirely in favor of a sequential integer id but compromised on this since MOB-suite does use the circularity information and it would be harder to map things back. I can put this in as a feature request to have the original headers maintained in the outputs.
Sorry for my digging further, but how does it cause issues with BLAST? As far as I'm aware, BLAST outputs contig ID as per FASTA standard in many/most of its output formats.
Thanks again, would really appreciate this addition!
The issue #87 shows the problem with super long headers. I have made a change to the code that all sequences will be renamed in the format {int}_{md5}_circular={status} for all blast and mash searches. Then the program will relabel all of the sequences to their original contig id's. This should solve your issue, as I agree there is a problem with mapping contigs back for integrating results from other tools. version 3.1.0 will implement this
Thanks @jrober84 for the effort! One suggestion: I'd suggest putting the sequence ID in the contig_id
field, and not the whole FASTA header. Eg. with this fix it puts contig_1 circular=true
in the contig_id
, but I'd suggest just contig_1
. There are a couple reasons for this:
Thanks for sharing reflections on contig_id
format. I think that both providing a full length fasta header as contig ID or just using identifier before the first space has its value because this way user might encode additional information that might be of use. On the other hand the contig_id
might get long to read, but the output should not be affected by length of presence of tabs and correctly be written to the report file by the writeReport(data_list, header, outfile)
function. All subsequent tools can just remove this information if it causes compatibility issues. It is easier to remove data than to add in my opinion. In addition, what if user decides to remove all spaces and substitute them by _
symbols. Then in this case we would not able to trim the contig_id
. I think the best compromise is to create new output parameter --contig_id_standardize
that would output standard contig_ids
truncated to the first space (if present).
Thanks @kbessonov1984. While the output may be "correctly" written to TSV by python, it won't be parseable as it will have variable numbers of columns depending on if a FASTA sequence had tabs in its sequence header. Eg. I don't think you'll be able to read the mob-recon output back into pandas or us python's csv.DictReader
, which is pretty problematic.
Thanks for the great tool and your help both on and offline.
I'm curious about the decision to
fix_fasta_header
before processing the FASTA. This makes it basically impossible (as far as I can tell) to combine MOB-Suite outputs with those of other tools dowstream, as it has irreversibly modified the actual contig IDs and then puts those out as new contig IDs in all of its outputs. Was this a conscious decision? Is it possible to leave the contig IDs as the FASTA standard which is the first word before a whitespace in the FASTA header?It seems like another approach would be to parse the input FASTA file with biopython, maintaining separation between the contig ID and description. As far as I can tell, it seems that this is the only place that actually needs the sequence description: https://github.com/phac-nml/mob-suite/blob/1d735b30053b45457a59c277c8d996ab86e0347c/mob_suite/mob_recon.py#L1138-L1142
I'd be happy to open a PR if that makes sense to you? Thanks again!