Closed AlbertRockG closed 1 year ago
With a little testing, there's a semantic change between faSomeRecords
and seqkit grep -n
. The first matches only on id (e.g. a FASTA definition line is >id description more description
); the second matches on the id + description.
@tsibley thank you for your review. When I noticed that the number of records exceed the number of entries in strains_africa.txt
file, I check the content of the sequences_africa.fasta
file and found sequences from South Korea
. Kindly find below screenshots of my terminal:
From that observation, I suspect that faSomeRecords
would have difficulty processing the id of sequences from South Africa
, confusing them with sequences from South Korea
.
I tried seqkit grep -n
and found it more efficient.
With a little testing, there's a semantic change between
faSomeRecords
andseqkit grep -n
. The first matches only on id (e.g. a FASTA definition line is>id description more description
); the second matches on the id + description.
As the ids are formatted by the provided script, I don't think this change in semantics matters. My idea is that faSomeRecords
might have trouble processing ids that contain spaces. Thanks.
Ah, that extra info is very helpful. I agree we do want to switch to seqkit grep -n
.
The bug you're seeing is indeed due to the difference in FASTA between a sequence's id and it's full name/description. The id is the part between the >
and the first space, e.g. in >x y z
the id is x
and the full name/description is x y z
.
faSomeRecords
only matches on the id, which is why filtering for hCov-19/South Africa
and hCov-19/South Korea
will yield the same results: it only matches on everything before the first space, or hCov-19/South
. This causes the bug, because this workflow's data/strains_africa.txt
file contains the full name/description for each sequence (as id isn't unique in this data).
seqkit grep
also defaults to matching on the id just like faSomeRecords
, but the -n
(or --by-name
) option changes it to match on the full name/description. As you suggested with your bug fix, that's indeed what we want.
Thanks for this! I'll cherry-pick your commit and merge.
@AlbertRockG I've cherry-picked your fix and amended it a bit (for some further adjustments to the README and the commit message) to make https://github.com/nextstrain/ncov-africa-cdc/pull/8. That's merged now. Thanks again for this bug report and fix!
Description of proposed changes
I am opening this pull request to suggest the use of
seqkit grep
instead offaSomeRecords
to extract Africans fasta records from thesequences_fasta.tar.xz
file downloaded from GISAID.Related issue(s)
I am making this suggestion because I noticed recently that the number of fasta records in the
sequences_africa.fasta
file obtained exceeds the number of entries in thestrains_africa.txt
file.Testing