Open ewallace opened 4 years ago
We expect this to work within a couple of hours; if not, we suspect a bug.
Set up branch Iserman-heatshockDed1-15
and pushed a (currently incomplete) .yaml config file for the dataset.
TODO:
Currently have Brar 2012 data set up to run on Eddie, so (because they're both Saccharomyces datasets and the Iserman branch is branched from master, which holds Brar) theoretically I could work on the 'Iserman-headshockDed1-15` branch safely I think.
This wouldn't be possible if I wanted to work on a NON-Saccharomyces branch (e.g. as I found with Cryptococcus) because if a job submission script started running (e.g. Brar) and looked for example-datasets files like the Saccharomyces Brar yaml config, or .fa / .gff3 files, those wouldn't exist if I was running in the non-merged CryptococcusH99-3 branch. The process would look for files which didn't exist, and the process would fail.
... unless I try this advice from Nick Brown on the Programming Skills course:
"it might be that your job clones a repository ( a completely fresh version of it) somewhere, then changes to a branch and executes that before deleting it... that would be safer and cleaner"
I haven't tested that advice yet, but just wanted to flag up that git branching on Eddie can potentially cause issues for workflow!
This is something I can work on while Eddie issue riboviz/#220 is still a blocker to actually running things on Eddie? Also, I can work on putting this together locally and try running there perhaps too, since I can still run things locally.
It's been quite a while since I looked at this, and I don't think the issue described above is relevant now, because I have a better understanding of branching and working with branches of the example-datasets repository while running data on Eddie.
Probably as a lot of time/development has happened since last time this was worked on, there's possibly quite a few different parameters in yaml files on develop
which should be double-checked and potentially added to the incomplete yaml for this here?
So @ewallace has identified that this would be a good dataset to work on @Maberuiz and @B209678-2021
It's quite out of date by now as riboviz has developed quite a bit since I started putting this dataset together.
Relevant example-datasets branch: Iserman-heatshockDed1-15
Riboviz Docs on how to upgrade a yaml file: https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md
check_fasta_gff
.We are trying to upgrade the yaml file using https://github.com/riboviz/riboviz/blob/develop/docs/user/upgrade-config.md but we are having some issues.
We ran: python -m riboviz.tools.upgrade_config_file -i ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config.yaml -o ~/riboviz/example-datasets/fungi/saccharomyces/Iserman_2020_heatshock_RPF_12-samples_CDS_w_250utrs_config_2-1.yaml
The error we get is:
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/exports/eddie3_homes_local/s2249175/riboviz/riboviz/riboviz/tools/upgrade_config_file.py", line 60, in
However, we have noticed that the file that is being run (riboviz/riboviz/riboviz/tools/upgrade_config_file.py) is not the one linked in the end of the manual page which is riboviz/riboviz/riboviz/upgrade_config.py.
Do you have any idea about this @FlicAnderson?
We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies
However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"
Do you know what would be the correct one to use?
We found more information on the linker oligonucleotides in https://www.sciencedirect.com/science/article/pii/S1046202316303292.
The first sequence in Table 8 matches that specified in the Iserman et al., supplementary information.
We are unsure what adapter sequence to specify in the yaml file. The data processing steps on the GEO database specifies
However, the supplementary information in the paper states "Libraries were generated according to protocol described in detail before21, using 3’-adapters (5’ (5rApp)NNNNCTGTAGGCACCATCAAT(3ddC) 3’) that were randomized at the first 4 positions of the 5’ end to minimize potential ligation biases"
Do you know what would be the correct one to use?
Check out the 'components' section here: https://github.com/riboviz/example-datasets/blob/main/add-new-dataset.md#adding-a-dataset-for-an-existing-species although there's not much info.
If you're stuck, try the steps listed here: https://github.com/riboviz/example-datasets/issues/48#issuecomment-779873481
And if all else fails, @ewallace is an expert at figuring this out, and might be able to help point you in the right direction.
Ran fastqc on the fastq files and found some overrepresented sequences.
The sequences are all the same but differs at the first 3 and last 3 nucleotides. We are going to try to run GAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGT as the adapter sequence with ^(?P
The fastqc also revealed a potentially high level of duplicated sequences. I'm not sure if this is normal for ribosome profiling or do we need to account for this. Would the percentage of sequences left after duplication (63.76%) be too low?
Could you provide some guidance @ewallace @FlicAnderson
Referencing this https://github.com/riboviz/example-datasets/issues/15#issuecomment-1130009988
I've had a look at the fastq file and it seems that they are using the first barcode (after looking at several sequences, before the adapter sequence I always found the barcode ATCGT and in a few cases a sequence that doesn't match any of the barcodes). Anyway, I will next try to run the dataset including those 5 bases in the umi rather than in the adapter sequence. Therefore, the umi regexp I'll use is: ^(?P<umi_1>.{2}).+(?P<umi_2>.{10})$
After running the dataset with the new regexp, the results are the same. Therefore, they must be using only the first barcode as observed in the sequences I had a look at and the problem must be somewhere else.
I looked at the read_counts_per_file.tsv
file, for sample WT_30_1, there were:
For sample WT_30_2, there were:
For WT_30_1, that means 40.36% of reads aligned to ORFs and 55,35% aligned with contaminants. For WT_30_2, that means 21,25% of reads aligned to ORFs and 36,6% aligned with contaminants.
Suggested yeast dataset of interest to me: Ribosome profiling data GEO: GSE131176.
From Condensation of Ded1p Promotes a Translational Switch from Housekeeping to Stress Protein Production, https://doi.org/10.1016/j.cell.2020.04.009.