Open FernandoDuarteF opened 1 day ago
It doesn't work on assests/samplesheet.csv
, same SEPP error as in #37. I think it's safe to say that this is a dead end.
Very important to note that this problem is related to the --auto-lineage
mode, and doesn't appear when the lineage is set.
I would say that the only options right now are:
--auto-lineage
works fine on the samplesheet.csv
. Although it's not assured it will work on other assemblies.--lineage
parameter for all assemblies (e.g. all assemblies come from the same lineage) but we can also add another column in the input samplesheet to set the lineage for each assembly (e.g. all or some assemblies come from different lineages).I also tried to flatten multi-line sequences into a single line, as well as checking with different sequence headers, but didn't solve the issue.
I tried adding fastavalidator but doesn't help. It throws the same error for the outputs of both AGAT longest isoform and the local perl script:
Vespa_velutina.prot.fa.largestIsoform.fa has non-sequence characters in it
Considering it only checks for [A-Za-z], I'm pretty sure it's referring to the "*" in the protein sequences.
I was checking the BUSCO phylogenomic tree protocol, and I noticed that they use the
agat_sp_extract_sequences.pl
script from AGAT to extract the proteins sequences from a filtered gff.Might be worth trying.
This issue is related to #37.