vector-engineering / covidcg

A COVID-19 CoV Genetics (CG) browser to inform therapeutics development
https://covidcg.org
MIT License
26 stars 5 forks source link

what are the input files from gisaid required for covidcg to work #251

Closed philt31 closed 3 years ago

philt31 commented 3 years ago

Hello, Could someone fix the link about what data files are required from gisaid for covidcg to work. there is a link on the data requirements (https://github.com/vector-engineering/covidcg#Data-Requirements) called "data files" in the analysis pipeline section but it doesnt seem to work (i.e. it doesnt tell you what input files you need). thanks phil

atc3 commented 3 years ago

Hi Phil,

I apologize, the pipeline is not working for users other than us currently. We recently switched over from manually downloaded files to a custom data feed provided by GISAID (as referenced by the data_feed_credentials and data_feed_url variables). Unfortunately, as per our agreement with GISAID, we are not permitted to share these credentials.

I am currently working on updating the Snakemake pipeline to accept more generalized inputs (i.e., from GISAID, GenBank or in-house sequences). In the meantime, an earlier version of COVID-CG at v1.2.0 (https://github.com/vector-engineering/covidcg/tree/v1.2.0) has a pipeline which accepts inputs as separate fasta files and metadata tsv files. GISAID no longer allows for downloads of sequence metadata in bulk, but maybe this earlier version can help. You will just have to structure your own internal metadata files to look like the GISAID ones:

patient_meta.tsv file (tab-separated):

Virus name  Accession ID    Collection date Location    Host    Additional location information Gender  Patient age Patient status  Passage Specimen    Additional host information Lineage Clade
hCoV-19/.../.../2020    EPI_ISL_....    2020-01-29  Region / Country / Division / Location  Human   Male    21  unknown Original    oropharyngeal swab  ""  B   L

seq_meta.tsv file (tab-separated):

Virus name  Accession ID    Collection date Location    Host    Passage Specimen    Additional host information Sequencing technology   Assembly method Comment Comment type    Lineage Clade
hCoV-19/.../.../2020    EPI_ISL_…   2020-01-29  Region / Country / Division / Location  Human   Original    oropharyngeal swab  Sanger, Nanopore    Sequencher 5.4.6, minimap 2.17  ""  ""  B   L

I understand that some of the data is redundant, but the old pipeline is not very flexible and is accounting for all of these fields being present.

Sorry for all the trouble. Like I mentioned earlier, we're currently designing a better pipeline that can ingest data from a variety of sources. I will let you know as soon as that version is ready.

Thanks, and hope this helps, Albert

atc3 commented 3 years ago

Hi Phil,

I've pushed some changes (from #255) that generalizes our pipeline to user-defined datasets. As an example, I've written a workflow where data is downloaded, ingested, and cleaned from GenBank. Details/instructions on this new workflow and how to customize it are in the main README and also within the ingest-specific README: workflow_genbank_ingest/README.

In summary, I've divided our pipeline into two steps: 1) ingestion and 2) main analysis. The ingestion step is input-specific and customizable (options are defined in config/config_[workflow].yaml. To process data other than from GenBank/GISAID, you will have to write your own workflow and config files. You can just copy the GenBank workflow and adapt off of that one to start. The main analysis then takes the ingested data, along with the configuration options, and builds the final data package. The configuration options are also passed to the front-end JS as well.

Please let me know if this helps, and I'd be happy to assist in writing another workflow like I've done with the GenBank ingest.

Albert