Iserman et al heat shock and Ded1 yeast dataset?

ewallace commented 4 years ago

Suggested yeast dataset of interest to me: Ribosome profiling data GEO: GSE131176.

From Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production, https://doi.org/10.1016/j.cell.2020.04.009.

kavousan commented 4 years ago

We expect this to work within a couple of hours; if not, we suspect a bug.

FlicAnderson commented 4 years ago

Set up branch Iserman-heatshockDed1-15 and pushed a (currently incomplete) .yaml config file for the dataset.

TODO:

[x] finish adding in the data files
[ ] double-check that the annotation files and contaminant file paths are correct
[ ] add more information about file structure expected
- [ ] check for missing parameters (since this one is adapted from B-Sc_2012 and there are a number of parameters about extracting umis etc and things that are likely to be null, but probably should be included.
- [ ] try downloading the read files and running the dataset (WITHOUT ALL SAMPLES SELECTED!!!)

FlicAnderson commented 4 years ago

Currently have Brar 2012 data set up to run on Eddie, so (because they're both Saccharomyces datasets and the Iserman branch is branched from master, which holds Brar) theoretically I could work on the 'Iserman-headshockDed1-15` branch safely I think.

This wouldn't be possible if I wanted to work on a NON-Saccharomyces branch (e.g. as I found with Cryptococcus) because if a job submission script started running (e.g. Brar) and looked for example-datasets files like the Saccharomyces Brar yaml config, or .fa / .gff3 files, those wouldn't exist if I was running in the non-merged CryptococcusH99-3 branch. The process would look for files which didn't exist, and the process would fail.

... unless I try this advice from Nick Brown on the Programming Skills course:

"it might be that your job clones a repository ( a completely fresh version of it) somewhere, then changes to a branch and executes that before deleting it... that would be safer and cleaner"

I haven't tested that advice yet, but just wanted to flag up that git branching on Eddie can potentially cause issues for workflow!

FlicAnderson commented 4 years ago

This is something I can work on while Eddie issue riboviz/#220 is still a blocker to actually running things on Eddie? Also, I can work on putting this together locally and try running there perhaps too, since I can still run things locally.

FlicAnderson commented 3 years ago

It's been quite a while since I looked at this, and I don't think the issue described above is relevant now, because I have a better understanding of branching and working with branches of the example-datasets repository while running data on Eddie.

Probably as a lot of time/development has happened since last time this was worked on, there's possibly quite a few different parameters in yaml files on develop which should be double-checked and potentially added to the incomplete yaml for this here?

FlicAnderson commented 2 years ago

So @ewallace has identified that this would be a good dataset to work on @Maberuiz and @B209678-2021

It's quite out of date by now as riboviz has developed quite a bit since I started putting this dataset together.

Relevant example-datasets branch: Iserman-heatshockDed1-15

Riboviz Docs on how to upgrade a yaml file: https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md

FlicAnderson commented 2 years ago

[x] Create a new branch of example-datasets with a helpful concise name
[x] Identify paper or data source - list and link
[x] Identify the species and strain used, check if example-datasets already has appropriate annotation and contaminant files.
[ ] (if new species) Find annotation data for the species and strain elsewhere.
[ ] (if new species and genus) Create a genus folder in example-datasets.
[ ] (if new species) Download or create contaminants fasta file.
[ ] (if new species) Download or create transcriptome annotation fasta and gff files.
[ ] (if new species) Check annotation files for consistency with check_fasta_gff.
[ ] Identify the ribosome profiling samples from the dataset (some may be RNA-seq) - link database.
[ ] Identify adapter sequence - provide sequence.
[ ] Confirm or deny presence of UMIs and barcodes if used - describe if present.
[ ] If UMIs are present, create UMI regular expression.
[ ] Using information gathered, create config file.
[ ] Download sample data.
[ ] (optional) Create downsampled data and fast test run on that.
[ ] Test run of full sized dataset.
[ ] Look at results - check for 3nt periodicity in coding regions, most common read lengths being 28-32 nt, and clear start and stop profiles.
[ ] Troubleshoot as necessary and discuss on issue ticket.
[ ] Update genus-level README.md and provenance section of config file.
[ ] Put in pull request to add to repository.

Eldonoho99 commented 2 years ago

We are trying to upgrade the yaml file using https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md but we are having some issues.

We ran: python -m riboviz.tools.upgrade_config_file -i ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config.yaml -o ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config_2-1.yaml

The error we get is:

Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/exports/eddie3_homes_local/s2249175/riboviz/riboviz/riboviz/tools/upgrade_config_file.py", line 60, in invoke_upgrade_config_file() File "/exports/eddie3_homes_local/s2249175/riboviz/riboviz/riboviz/tools/upgrade_config_file.py", line 56, in invoke_upgrade_config_file upgrade_config.upgrade_config_file(input_file, output_file) File "riboviz/upgrade_config.py", line 149, in upgrade_config_file yaml.dump(config, f, default_flow_style=False, sort_keys=False) File "/usr/lib64/python2.7/site-packages/yaml/init.py", line 202, in dump return dump_all([data], stream, Dumper=Dumper, **kwds) TypeError: dump_all() got an unexpected keyword argument 'sort_keys'

However, we have noticed that the file that is being run (riboviz/riboviz/riboviz/tools/upgrade_config_file.py) is not the one linked in the end of the manual page which is riboviz/riboviz/riboviz/upgrade_config.py.

Do you have any idea about this @FlicAnderson?

Eldonoho99 commented 2 years ago

We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies

However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"

Do you know what would be the correct one to use?

Eldonoho99 commented 2 years ago

We found more information on the linker oligonucleotides in https://www.sciencedirect.com/science/article/pii/S1046202316303292.

The first sequence in Table 8 matches that specified in the Iserman et al., supplementary information.

FlicAnderson commented 2 years ago

We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies

However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"

Do you know what would be the correct one to use?

Check out the 'components' section here: https://github.com/riboviz/example-datasets/blob/main/add-new-dataset.md#adding-a-dataset-for-an-existing-species although there's not much info.

If you're stuck, try the steps listed here: https://github.com/riboviz/example-datasets/issues/48#issuecomment-779873481

And if all else fails, @ewallace is an expert at figuring this out, and might be able to help point you in the right direction.

Eldonoho99 commented 2 years ago

Ran fastqc on the fastq files and found some overrepresented sequences.

The sequences are all the same but differs at the first 3 and last 3 nucleotides. We are going to try to run GAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGT as the adapter sequence with ^(?P.{3}).+(?P.{3}) as the regular expression.

Eldonoho99 commented 2 years ago

The fastqc also revealed a potentially high level of duplicated sequences. I'm not sure if this is normal for ribosome profiling or do we need to account for this. Would the percentage of sequences left after duplication (63.76%) be too low?

Could you provide some guidance @ewallace @FlicAnderson

Maberuiz commented 2 years ago

Referencing this https://github.com/riboviz/example-datasets/issues/15#issuecomment-1130009988

I've had a look at the fastq file and it seems that they are using the first barcode (after looking at several sequences, before the adapter sequence I always found the barcode ATCGT and in a few cases a sequence that doesn't match any of the barcodes). Anyway, I will next try to run the dataset including those 5 bases in the umi rather than in the adapter sequence. Therefore, the umi regexp I'll use is: ^(?P<umi_1>.{2}).+(?P<umi_2>.{10})$

Maberuiz commented 2 years ago

After running the dataset with the new regexp, the results are the same. Therefore, they must be using only the first barcode as observed in the sequences I had a look at and the problem must be somewhere else.

Maberuiz commented 2 years ago

I looked at the read_counts_per_file.tsv file, for sample WT_30_1, there were:

29,107,652 initial reads
29,107,165 after removal of adapters
12,991,783 reads that did not align with contaminants
11,748,465 reads aligned to ORFs index files
11,748,463 reads after trimming of 5' mismatches and removal of those with more than 2 mismatches

For sample WT_30_2, there were:

23,874,725 initial reads
23,874,033 after removal of adapters
15,135,171 reads that did not align with contaminants
5,073,406 reads aligned to ORFs index files
5,073,406 reads after trimming of 5' mismatches and removal of those with more than 2 mismatches

Maberuiz commented 2 years ago

For WT_30_1, that means 40.36% of reads aligned to ORFs and 55,35% aligned with contaminants. For WT_30_2, that means 21,25% of reads aligned to ORFs and 36,6% aligned with contaminants.

riboviz / example-datasets

Iserman et al heat shock and Ded1 yeast dataset? #15