Strange results with fasta input

jtangrot commented 1 year ago

Description of the bug

When using fasta input and the description/header line contain more than just the sequence name, the taxonomy files do not look as expected. The whole description line is used instead of just the sequence name, and in case the description contains a tab character the sequence is placed on a line of its own instead of in the sequence column in e.g. ASV_tax.tsv.

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 1 year ago

That indeed seems imperfect. What would be your preferred solution? Modify the fasta file that was provided by --input appropriately, i.e. drop all text after sequence name (could be done in an additional process)?

jtangrot commented 1 year ago

I was thinking to (i) add a line in modules/local/dada2_taxonomy.nf to remove anything but sequence name, and (ii) fix a bug in bin/add_full_sequence_to_taxfile.py that was supposed to remove anything but sequence name but currently fails. But maybe your suggestion to modify the input fasta is better/more general?

erikrikarddaniel commented 1 year ago

I'm not sure I understand all details, but personally I don't like programs that remove everything after the first space -- often that includes organism names -- so if we can avoid that it would be good. A test case would be nice.

jtangrot commented 1 year ago

Agreed, but in this case when Ampliseq is used to assign taxonomies - is it essential to keep organism names in the original fasta file? Also, keeping the complete header line might result in quite long ASV_ID's in e.g. ASV_tax_species.tsv and other taxonomy files. A simple test case is to take a small fasta containing more than just name in the header lines and run ampliseq, then check out dada2/ASV_tax_species.tsv. Esp. if the name is separated from description with a tab instead of space this looks strange. An example with pacbio reads is attached. ASV_seqs.header.small.fasta.gz

erikrikarddaniel commented 1 year ago

Agreed, but in this case when Ampliseq is used to assign taxonomies - is it essential to keep organism names in the original fasta file?

"Essential" perhaps not, but if possible would be good. Since this is a user specified file, it might contain any information the user would like to keep.

Also, keeping the complete header line might result in quite long ASV_ID's in e.g. ASV_tax_species.tsv and other taxonomy files.

That's a point of course. In my opinion that would be up to the user of the pipeline though.

But, the most important thing is to keep everything stable and smooth, so if cutting everything after the first space is the best way, no problem.

A simple test case is to take a small fasta containing more than just name in the header lines and run ampliseq, then check out dada2/ASV_tax_species.tsv. Esp. if the name is separated from description with a tab instead of space this looks strange. An example with pacbio reads is attached. ASV_seqs.header.small.fasta.gz

Wouldn't that be good to have in the pipeline tests? I think replacing the one we have with a problematic one would be fine.

d4straub commented 1 year ago

If you think its the users task to provide a suitable file, then it might be sufficient to update the documentation? At least than the user is empowered to follow rules to come to an appropriate result. And/or convert all upsetting characters into e.g. underscores or spaces, so that the worst consequences of "bad" user input are mitigated, of course ideally also documenting that ;)

erikrikarddaniel commented 1 year ago

If you think its the users task to provide a suitable file, then it might be sufficient to update the documentation? At least than the user is empowered to follow rules to come to an appropriate result. And/or convert all upsetting characters into e.g. underscores or spaces, so that the worst consequences of "bad" user input are mitigated, of course ideally also documenting that ;)

I would say yes to the latter, i.e. convert characters we know don't work. Note however that this is not a big deal for me so I'm happy with anything you two decide!

d4straub commented 1 year ago

So that is solved, right? If yes, please close.

jtangrot commented 1 year ago

Fixed in #544

nf-core / ampliseq