virus-evolution / phylopipe

Pipeline to build a massive tree from global data
GNU General Public License v3.0
10 stars 0 forks source link

Investigate UK sequences missing `phylotype` assignment but containing `uk_lineage` #13

Closed rmcolq closed 3 years ago

rmcolq commented 3 years ago

UK422 for example: 12403 in the dataset, 1240 have a phylotype:

grep "UK422," /cephfs/covid/bham/results/phylogenetics/beta/metadata/cog_global_2021-05-31_public.csv | wc -l
12403
grep "UK422," /cephfs/covid/bham/results/phylogenetics/beta/metadata/cog_global_2021-05-31_public.csv | cut -f 17 -d , | grep "^_" | wc -l
1240
rmcolq commented 3 years ago

First discovery, cut doesn't work with these CSV because some fields contain commas, but have quotation marks around them so are handled by other parsers.

rmcolq commented 3 years ago

Summary of problems found:

  1. nextflow's inbuilt collectFile function was bugging when combining the CSVs of phyloptypes per uk_lineage because they weren't saved in unix flavoured mode. Clusterfunk was updated accordingly.
  2. nextflow's inbuilt splitcsv function was bugging and writing things that had a uk_lineage to .txt instead of UK??.txt. Have replaced by python script.