Closed imilenkovic closed 3 years ago
Hello,
Yes we do have a mouse reference fore RiboFlow. They are in a separate repository:
https://github.com/ribosomeprofiling/references_for_riboflow
For mouse data, the reference files needed by RiboFlow are here: https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/riboflow_annot_and_ref
If you want to generate your own reference, you can check out our scripts and python notebooks in the same repository: https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/scripts
I hope this helps. Best wishes.
Hello,
Thank you very much for the quick answer!
I replaced all the human transcriptome/annotation files for mouse, and I ran the pipeline by encountered the following error:
[35/d57f83] NOTE: Missing output file(s) KO.3.filter.bam
expected by process filter (1)
-- Execution is retried (1)
[b8/43ea66] NOTE: Missing output file(s) KO.2.filter.bam
expected by process filter (2)
-- Execution is retried (1)
[4b/82e92a] NOTE: Missing output file(s) WT.1.filter.bam
expected by process filter (3)
-- Execution is retried (1)
[f9/4a8eb3] NOTE: Missing output file(s) WT.2.filter.bam
expected by process filter (4)
-- Execution is retried (1)
[54/48fbf5] NOTE: Missing output file(s) WT.3.filter.bam
expected by process filter (5)
-- Execution is retried (1)
In the project.yaml instead of GSM1606107 and GSM1606108 I put WT and KO and gave paths to the 3 WT and 3 KO replicates (fastq files).
Do you maybe have an idea what went wrong?
Thank you!
Did you also update the mouse reference entries in the project.yaml file? If not, bowtie2 won't be able to find reference files for the filtering.
More explicitly, the following entries need to be changed so that riboflow uses the mouse reference. You need to copy the mouse reference files from our github repo and provide their paths.
inpput:
reference:
filter: ./rf_sample_data/filter/human_rtRNA*
transcriptome: ./rf_sample_data/transcriptome/appris_human_24_01_2019_selected*
regions: ./rf_sample_data/annotation/appris_human_24_01_2019_actual_regions.bed
transcript_lengths: ./rf_sample_data/annotation/appris_human_24_01_2019_selected.lengths.tsv
I copied all the mouse reference/annotation files to respective folders and changed the entries:
filter: ./rf_sample_data/filter/mouse_rtRNA*
# transcriptome indicates bowtie2 index files
# Generated from isoform sequences.
transcriptome: ./rf_sample_data/transcriptome/appris_mouse_v1_selected*
# Main annotation file.
# CDS and UTR regions are defined in this file.
regions: ./rf_sample_data/annotation/appris_mouse_v1_actual_regions.bed
# Transcript lengths
transcript_lengths: ./rf_sample_data/annotation/appris_mouse_v1_transcript_lengths.tsv
Should I maybe provide the full path instead of "./" since I am running the pipeline with nextflow?
Now it gave:
[a0/d12e25] NOTE: Missing output file(s) KO.1.transcriptome_alignment.bam
expected by process transcriptome_alignment (6)
-- Execution is retried (1)
Yes, using the absolute ( full ) path is a safer option.
Also, going to the "work" folder and reading the log files (of the filter step) will be helpful. Those are typically hidden files whose file names start with a ".".
This is a good sign! This means the filtering step worked and it encountered and error in the transcriptome alignment step.
Going to the work folder work/a0
and checking out the log files therein might also be helpful.
I would assume that it's failing to generate the .bam files for some reason?
Yes, it seems like so.
In
samtools index -@ '{task.cpus}' KO.1.transcriptome_alignment.bam
The systems should have replaced '{task.cpus}' with the exact number. I wonder why this didn't happen.
Did you manage to run the "test run", that was provided in the documentation? Did it go successfully?
myjob.log I thought it did since I looked at the execution status, but now I see that it had similar errors; however it kept going. I wasn't sure if it was normal for everything to be done in only 12 minutes, but I thought since it was test data that it actually worked.
I would check if the ribo file is generated properly. The sample data should run without any errors and I see that this was run via singularity. So I am not sure if singularity integration was seamless.
Dear all, I checked that in the code in some processes there is {task.cpus} instead of ${task.cpus}. See here as an example: https://github.com/ribosomeprofiling/riboflow/blob/df3e891b1f9f04259ea071da4ff3198d5ae6b242/RiboFlow.groovy#L260
I made a pull request to fix this (https://github.com/ribosomeprofiling/riboflow/pull/9) and I also added a new configuration for running the pipeline using singularity. I hope this can help.
Best,
Hello, With the correction Luca made there are no more errors and the ribo files are produced; however, the whole run took only 6 minutes (and the input files are 6 illumina fastq files with ~30M reads each). Is this normal or is something still off? RiboSeq_Mirko4.log
Hello Ivan, this because is doing caching. Can you do it from scratch?
Hello,
I managed to run the pipeline successfully; however, what I have noticed is that the number of "after_dedup" reads and the reads I see in the .ribo objects when I load them with ribor are quite different:
after_dedup, 492314, 349768 ribor total.reads: 2759, 2239
Do you maybe have an idea where all the missing reads go, and why I end up with such a low number of reads? I attached the stats file from the output folder. Thanks! stats.txt
Hello,
Could you also share your parameters file ( the yaml file you provided input fastqs, mouse reference etc )?
Here - added the .txt so that I can upload it here.
This low number might be due to the read length range. I suggest changing the min read length to 15 and max read length to 40. Could you rerun the pipeline after this change and check the ribo file again? Also, looking at the read length distribution will be very helpful. If there are many ribosome footprints of length ,say greater than 33, they won't make it to the ribo file in the current settings. By modifying the parameters, you can include them.
More explicitly, in the following section of the project.yaml file, I recommend setting min to 15 and max to 40
# old version
# RiboPy parameters for ribo file generation.
ribo:
ref_name: "appris-v1"
metagene_radius: 50
left_span: 35
right_span: 10
read_length:
min: 28
max: 32
coverage: true
So after the recommended changes, it should be as follows:
# new version
# RiboPy parameters for ribo file generation.
ribo:
ref_name: "appris-mouse-v1"
metagene_radius: 50
left_span: 35
right_span: 10
read_length:
min: 15
max: 40
coverage: true
I think this did the trick, thanks a lot!
I just saw that you also changed the ref_name - is this important or it's just the way it's named? I assume it still reading the mouse files I specified elsewhere.
No, reference name is not important. I just changed it so that mouse ribo files can easily be distinguished from human ribo files.
I am happy to hear that changing the read length solved the problem.
Thank you for providing feedback! It helps us a lot!
Hello,
I would like to run the pipeline on some mouse Ribo-seq data, and I was wondering if you might have a mouse annotation file, such as this one for human: appris_human_24_01_2019_actual_regions.bed ? Alternatively, could you point me to how I could generate such a file?
Thanks a lot!
Best, Ivan