loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
34 stars 2 forks source link

Use `data_report.jsonl` instead of `dataformat tsv` in ingest pipeline #1428

Open corneliusroemer opened 6 months ago

corneliusroemer commented 6 months ago

The JSONL data_report.jsonl version is more structured and contains more info than the output of dataformat tsv.

For example it has authors as a list, rather than as a concatenated string without whitespace between individuals. It also has a sequence hash field that can be useful (saves us from having to compute the hash ourselves). It also contains the full taxonomic hierarchy.

corneliusroemer commented 3 weeks ago

An extra benefit of doing this ourselves is that we'll avoid this type of bug in the future: https://github.com/ncbi/datasets/issues/404