Closed philt31 closed 3 years ago
Hi Phil,
I apologize, the pipeline is not working for users other than us currently. We recently switched over from manually downloaded files to a custom data feed provided by GISAID (as referenced by the data_feed_credentials and data_feed_url variables). Unfortunately, as per our agreement with GISAID, we are not permitted to share these credentials.
I am currently working on updating the Snakemake pipeline to accept more generalized inputs (i.e., from GISAID, GenBank or in-house sequences). In the meantime, an earlier version of COVID-CG at v1.2.0 (https://github.com/vector-engineering/covidcg/tree/v1.2.0) has a pipeline which accepts inputs as separate fasta files and metadata tsv files. GISAID no longer allows for downloads of sequence metadata in bulk, but maybe this earlier version can help. You will just have to structure your own internal metadata files to look like the GISAID ones:
patient_meta.tsv file (tab-separated):
Virus name Accession ID Collection date Location Host Additional location information Gender Patient age Patient status Passage Specimen Additional host information Lineage Clade
hCoV-19/.../.../2020 EPI_ISL_.... 2020-01-29 Region / Country / Division / Location Human Male 21 unknown Original oropharyngeal swab "" B L
seq_meta.tsv file (tab-separated):
Virus name Accession ID Collection date Location Host Passage Specimen Additional host information Sequencing technology Assembly method Comment Comment type Lineage Clade
hCoV-19/.../.../2020 EPI_ISL_… 2020-01-29 Region / Country / Division / Location Human Original oropharyngeal swab Sanger, Nanopore Sequencher 5.4.6, minimap 2.17 "" "" B L
I understand that some of the data is redundant, but the old pipeline is not very flexible and is accounting for all of these fields being present.
Sorry for all the trouble. Like I mentioned earlier, we're currently designing a better pipeline that can ingest data from a variety of sources. I will let you know as soon as that version is ready.
Thanks, and hope this helps, Albert
Hi Phil,
I've pushed some changes (from #255) that generalizes our pipeline to user-defined datasets. As an example, I've written a workflow where data is downloaded, ingested, and cleaned from GenBank. Details/instructions on this new workflow and how to customize it are in the main README and also within the ingest-specific README: workflow_genbank_ingest/README
.
In summary, I've divided our pipeline into two steps: 1) ingestion and 2) main analysis. The ingestion step is input-specific and customizable (options are defined in config/config_[workflow].yaml
. To process data other than from GenBank/GISAID, you will have to write your own workflow and config files. You can just copy the GenBank workflow and adapt off of that one to start. The main analysis then takes the ingested data, along with the configuration options, and builds the final data package. The configuration options are also passed to the front-end JS as well.
Please let me know if this helps, and I'd be happy to assist in writing another workflow like I've done with the GenBank ingest.
Albert
Hello, Could someone fix the link about what data files are required from gisaid for covidcg to work. there is a link on the data requirements (https://github.com/vector-engineering/covidcg#Data-Requirements) called "data files" in the analysis pipeline section but it doesnt seem to work (i.e. it doesnt tell you what input files you need). thanks phil