Add new dataset Thermotolerans

acope3 commented 1 year ago

Thanks for starting to add a new dataset to example-datasets! This issue template includes the key steps, see add-new-dataset.md. Please edit as needed for your dataset.

[x] Create a new branch of example-datasets with a helpful concise name, for example cheng-entamoeba-123 if the dataset were generated by Dr. Cheng, from entamoeba, and the new issue ticket is number 123.
[x] Identify paper or data source - list and link
[x] Identify the species and strain used, check if example-datasets already has appropriate annotation and contaminant files.
[x] (if new species) Find annotation data for the species and strain elsewhere.
[x] (if new species and genus) Create a genus folder in example-datasets.
[x] (if new species) Download or create contaminants fasta file.
[x] (if new species) Download or create transcriptome annotation fasta and gff files.
[ ] (if new species) Check annotation files for consistency with check_fasta_gff.
[x] Identify the ribosome profiling samples from the dataset (some may be RNA-seq) - link database.
[x] Identify adapter sequence - provide sequence.
[x] Confirm or deny presence of UMIs and barcodes if used - describe if present.
[x] If UMIs are present, create UMI regular expression.
[x] Using information gathered, create config file.
[x] Download sample data.
[x] (optional) Create downsampled data and fast test run on that.
[x] Test run of full sized dataset.
[x] Look at results - check for 3nt periodicity in coding regions, most common read lengths being 28-32 nt, and clear start and stop profiles.
[ ] Troubleshoot as necessary and discuss on issue ticket.
[ ] Update genus-level README.md and provenance section of config file.
[ ] Put in pull request to add to repository.

acope3 commented 1 year ago

I have added the annotation files. @davbunn1 @HannahMaroof, I see that you both were assigned to the issue for creating the L. kluyveri dataset, so I have assigned you to this issue, as well. Branch is cope-thermotolerans-120.

davbunn1 commented 1 year ago

Working in branch: "cope-thermotolerans-120" Genus folder within example-datasets: "thermotolerans" Strain: Lachancea thermotolerans Y-8284 (existing L. thermotolerans contaminants data on example-datasets from L. kluyveri) Data source: EIRNA BIO in 2023 (James Keane, Darren Fenton and others) Transcriptome annotation courtesy of Alex Cope @acope3

davbunn1 commented 1 year ago

Contaminants fasta file created from ncbi data as detailed in provenance file UMIs (N) and Barcodes (B) used:

Read structure: NNNN - rpf sequence - NNNNN - BBBBB – Adapter Barcodes: Rep1 – ATCGT, Rep2 – AGCTA, Rep3 - CGTAA Adapter sequence: AGATCGGAAGAGCACACGTCTGAA

config.yaml file created (EIRNA_2023_LT_3-samples_cds_250nt_utr_config.yaml) and successfully run on full-sized dataset.

check_fasta_gff pending - currently getting the error: no module named 'pyfaidx'. This persists even after using 'source activate riboviz' which I had thought would load the necessary modules...

davbunn1 commented 1 year ago

Results from the full-sized test run look excellent. Just waiting on check_fasta_gff before putting in a pull request

riboviz / example-datasets

Add new dataset Thermotolerans #120