riboviz / example-datasets

Example datasets to run with RiboViz
Apache License 2.0
2 stars 7 forks source link

Iserman et al heat shock and Ded1 yeast dataset? #15

Open ewallace opened 4 years ago

ewallace commented 4 years ago

Suggested yeast dataset of interest to me: Ribosome profiling data GEO: GSE131176.

From Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production, https://doi.org/10.1016/j.cell.2020.04.009.

kavousan commented 4 years ago

We expect this to work within a couple of hours; if not, we suspect a bug.

FlicAnderson commented 4 years ago

Set up branch Iserman-heatshockDed1-15 and pushed a (currently incomplete) .yaml config file for the dataset.

TODO:

FlicAnderson commented 4 years ago

Currently have Brar 2012 data set up to run on Eddie, so (because they're both Saccharomyces datasets and the Iserman branch is branched from master, which holds Brar) theoretically I could work on the 'Iserman-headshockDed1-15` branch safely I think.

This wouldn't be possible if I wanted to work on a NON-Saccharomyces branch (e.g. as I found with Cryptococcus) because if a job submission script started running (e.g. Brar) and looked for example-datasets files like the Saccharomyces Brar yaml config, or .fa / .gff3 files, those wouldn't exist if I was running in the non-merged CryptococcusH99-3 branch. The process would look for files which didn't exist, and the process would fail.

... unless I try this advice from Nick Brown on the Programming Skills course:

"it might be that your job clones a repository ( a completely fresh version of it) somewhere, then changes to a branch and executes that before deleting it... that would be safer and cleaner"

I haven't tested that advice yet, but just wanted to flag up that git branching on Eddie can potentially cause issues for workflow!

FlicAnderson commented 4 years ago

This is something I can work on while Eddie issue riboviz/#220 is still a blocker to actually running things on Eddie? Also, I can work on putting this together locally and try running there perhaps too, since I can still run things locally.

FlicAnderson commented 3 years ago

It's been quite a while since I looked at this, and I don't think the issue described above is relevant now, because I have a better understanding of branching and working with branches of the example-datasets repository while running data on Eddie.

Probably as a lot of time/development has happened since last time this was worked on, there's possibly quite a few different parameters in yaml files on develop which should be double-checked and potentially added to the incomplete yaml for this here?

FlicAnderson commented 2 years ago

So @ewallace has identified that this would be a good dataset to work on @Maberuiz and @B209678-2021

It's quite out of date by now as riboviz has developed quite a bit since I started putting this dataset together.

Relevant example-datasets branch: Iserman-heatshockDed1-15

Riboviz Docs on how to upgrade a yaml file: https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md

FlicAnderson commented 2 years ago
Eldonoho99 commented 2 years ago

We are trying to upgrade the yaml file using https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md but we are having some issues.

We ran: python -m riboviz.tools.upgrade_config_file -i ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config.yaml -o ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config_2-1.yaml

The error we get is:

Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/exports/eddie3_homes_local/s2249175/riboviz/riboviz/riboviz/tools/upgrade_config_file.py", line 60, in invoke_upgrade_config_file() File "/exports/eddie3_homes_local/s2249175/riboviz/riboviz/riboviz/tools/upgrade_config_file.py", line 56, in invoke_upgrade_config_file upgrade_config.upgrade_config_file(input_file, output_file) File "riboviz/upgrade_config.py", line 149, in upgrade_config_file yaml.dump(config, f, default_flow_style=False, sort_keys=False) File "/usr/lib64/python2.7/site-packages/yaml/init.py", line 202, in dump return dump_all([data], stream, Dumper=Dumper, **kwds) TypeError: dump_all() got an unexpected keyword argument 'sort_keys'

However, we have noticed that the file that is being run (riboviz/riboviz/riboviz/tools/upgrade_config_file.py) is not the one linked in the end of the manual page which is riboviz/riboviz/riboviz/upgrade_config.py.

Do you have any idea about this @FlicAnderson?

Eldonoho99 commented 2 years ago

We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies

However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"

Do you know what would be the correct one to use?

Eldonoho99 commented 2 years ago

We found more information on the linker oligonucleotides in https://www.sciencedirect.com/science/article/pii/S1046202316303292.

The first sequence in Table 8 matches that specified in the Iserman et al., supplementary information.

image
FlicAnderson commented 2 years ago

We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies

However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"

Do you know what would be the correct one to use?

Check out the 'components' section here: https://github.com/riboviz/example-datasets/blob/main/add-new-dataset.md#adding-a-dataset-for-an-existing-species although there's not much info.

If you're stuck, try the steps listed here: https://github.com/riboviz/example-datasets/issues/48#issuecomment-779873481

And if all else fails, @ewallace is an expert at figuring this out, and might be able to help point you in the right direction.

Eldonoho99 commented 2 years ago

Ran fastqc on the fastq files and found some overrepresented sequences.

The sequences are all the same but differs at the first 3 and last 3 nucleotides. We are going to try to run GAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGT as the adapter sequence with ^(?P.{3}).+(?P.{3}) as the regular expression.

image image
Eldonoho99 commented 2 years ago

The fastqc also revealed a potentially high level of duplicated sequences. I'm not sure if this is normal for ribosome profiling or do we need to account for this. Would the percentage of sequences left after duplication (63.76%) be too low?

Could you provide some guidance @ewallace @FlicAnderson

image
Maberuiz commented 2 years ago

Referencing this https://github.com/riboviz/example-datasets/issues/15#issuecomment-1130009988

I've had a look at the fastq file and it seems that they are using the first barcode (after looking at several sequences, before the adapter sequence I always found the barcode ATCGT and in a few cases a sequence that doesn't match any of the barcodes). Anyway, I will next try to run the dataset including those 5 bases in the umi rather than in the adapter sequence. Therefore, the umi regexp I'll use is: ^(?P<umi_1>.{2}).+(?P<umi_2>.{10})$

Maberuiz commented 2 years ago

After running the dataset with the new regexp, the results are the same. Therefore, they must be using only the first barcode as observed in the sequences I had a look at and the problem must be somewhere else.

Maberuiz commented 2 years ago

I looked at the read_counts_per_file.tsv file, for sample WT_30_1, there were:

For sample WT_30_2, there were:

Maberuiz commented 2 years ago

For WT_30_1, that means 40.36% of reads aligned to ORFs and 55,35% aligned with contaminants. For WT_30_2, that means 21,25% of reads aligned to ORFs and 36,6% aligned with contaminants.