viralemergence / virion

The Global Virome in One Network
https://viralemergence.github.io/virion
37 stars 8 forks source link

Collapse rows by NCBIAccession if everything else is identical? #62

Closed cjcarlson closed 2 years ago

cjcarlson commented 2 years ago

Haven't decided whether to do this yet - makes some of the records more information-dense and adds a step to working with the accession records, but it also substantially reduces the number of rows occupied by the GenBank data, which probably means the size of the dataset is more truthful (but less impressive! oops!)

cjcarlson commented 2 years ago

It turns out these are actually only collapsed in the PREDICT workflow, which I'm going to run through and remove as a step, so that it's internally consistent. I think it can be nice to do this with the accessions as a storage move but it also means that people have to expect the accession field to be strings of values they have to decompose if they want to programatically query them, and it's probably better and easier to just have a big slow dataset

cjcarlson commented 2 years ago

I've removed this from PREDICT workflow but in 240650cbc8b85ce6d79c1da72a47c6c36a92f745 I've actually decided to re-implement it in step 04 as the final compilation step, because it makes it much, much easier to work with the dataset