nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
35 stars 20 forks source link

Ingest VirusSeq data #421

Open chaoran-chen opened 7 months ago

chaoran-chen commented 7 months ago

VirusSeq is a Canadian data portal that hosts over 500,000 SARS-CoV-2 sequences that are not on GenBank. It would be amazing if they could be incorporated into the ncov open dataset.

I talked to Sally Otto and Justin Jia (@bfjia) who work on VirusSeq and they support this. Users of sequences from VirusSeq should properly acknowledge the data generators. The policy is explained on this website and @bfjia can provide further information.

Once the data are in the open dataset, they can be ingested into LAPIS open and Sally and Justin are interested in subsequently fetching data from LAPIS, for example, for Duotang.

tsibley commented 5 months ago

Unfortunately, it seems (to me at least) that the usage policy precludes incorporation in our Open dataset. The policy is closer to GISAID's (requires acknowledgement and co-operation with submitters) than INSDC's (no restrictions). We wouldn't be able to meet nor pass along those requirements once incorporated into the Open dataset. (Obviously we strongly encourage acknowledgement and co-operation with data generators/submitters regardless of data source, but there's a difference between expecting courtesy and requiring it.)

Presumably these usage restrictions are why the dataset is not part of INSDC already.

bfjia commented 5 months ago

@tsibley I think we can address this issue on open data policy. I will be including this as part of our next biweekly meeting and provide you with an update shortly after. Would this link (https://www.insdc.org/policy/) be a good summary of the policy that I can forward to the VirusSeq team? Thank you.

tsibley commented 5 months ago

@bfjia Ah, excellent. The INSDC policy you link to would work, but there are other options too such as a CC-BY license (with attribution to CanCOGeN VirusSeq rather than individual submitters).