replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 17 forks source link

Split fasta improvement #239

Closed DataSpott closed 1 year ago

DataSpott commented 1 year ago

Changes to split_fasta.py in the fasta-input:

  1. Fixed the issue with leading empty lines in the fasta-file causing an error that fails the whole pipeline -> now empty-lines are skipped completely
  2. Changed behaviour regarding the fasta-header:
    • Before the header was split when a whitespace occured and only the first part wastaken as the new fasta-header -> this could lead to problems with fastas from e.g. GISAID that included whitespaces in their name (like "hcov19/Hong Kong/...") and were therefore detected as duplicates of the same sequence even if it were different sequences.
    • Now whitespaces are replaced with "_" -> can lead to longer file-names & fasta-headers, but the whole header information is preserved

Solves issue #229

replikation commented 1 year ago

please do some "worst case" scenario testings and also tests with default (from fastq) to check that nothing bad is happening at the final HTML. if everything is fine we can merge @DataSpott

DataSpott commented 1 year ago

Tested now with the "test_fastq"- & "test_fasta"-profile as well as a fasta-file with tow hard cutted fasta-sequences. Worked in all three cases properly.