ribosomeprofiling / riboflow

Pipeline for Ribosome Profiling Data
MIT License
14 stars 9 forks source link

Mouse annotation #15

Closed imilenkovic closed 3 years ago

imilenkovic commented 3 years ago

Hello,

I would like to run the pipeline on some mouse Ribo-seq data, and I was wondering if you might have a mouse annotation file, such as this one for human: appris_human_24_01_2019_actual_regions.bed ? Alternatively, could you point me to how I could generate such a file?

Thanks a lot!

Best, Ivan

hakanozadam commented 3 years ago

Hello,

Yes we do have a mouse reference fore RiboFlow. They are in a separate repository:

https://github.com/ribosomeprofiling/references_for_riboflow

For mouse data, the reference files needed by RiboFlow are here: https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/riboflow_annot_and_ref

If you want to generate your own reference, you can check out our scripts and python notebooks in the same repository: https://github.com/ribosomeprofiling/references_for_riboflow/tree/master/transcriptome/mouse/v1/scripts

I hope this helps. Best wishes.

imilenkovic commented 3 years ago

Hello,

Thank you very much for the quick answer!

I replaced all the human transcriptome/annotation files for mouse, and I ran the pipeline by encountered the following error:

[35/d57f83] NOTE: Missing output file(s) KO.3.filter.bam expected by process filter (1) -- Execution is retried (1) [b8/43ea66] NOTE: Missing output file(s) KO.2.filter.bam expected by process filter (2) -- Execution is retried (1) [4b/82e92a] NOTE: Missing output file(s) WT.1.filter.bam expected by process filter (3) -- Execution is retried (1) [f9/4a8eb3] NOTE: Missing output file(s) WT.2.filter.bam expected by process filter (4) -- Execution is retried (1) [54/48fbf5] NOTE: Missing output file(s) WT.3.filter.bam expected by process filter (5) -- Execution is retried (1)

In the project.yaml instead of GSM1606107 and GSM1606108 I put WT and KO and gave paths to the 3 WT and 3 KO replicates (fastq files).

Do you maybe have an idea what went wrong?

Thank you!

hakanozadam commented 3 years ago

Did you also update the mouse reference entries in the project.yaml file? If not, bowtie2 won't be able to find reference files for the filtering.

More explicitly, the following entries need to be changed so that riboflow uses the mouse reference. You need to copy the mouse reference files from our github repo and provide their paths.

inpput:
  reference:
      filter: ./rf_sample_data/filter/human_rtRNA*

      transcriptome: ./rf_sample_data/transcriptome/appris_human_24_01_2019_selected*

      regions: ./rf_sample_data/annotation/appris_human_24_01_2019_actual_regions.bed

      transcript_lengths: ./rf_sample_data/annotation/appris_human_24_01_2019_selected.lengths.tsv
imilenkovic commented 3 years ago

I copied all the mouse reference/annotation files to respective folders and changed the entries:

  filter: ./rf_sample_data/filter/mouse_rtRNA*

  # transcriptome indicates bowtie2 index files
  # Generated from isoform sequences.
  transcriptome: ./rf_sample_data/transcriptome/appris_mouse_v1_selected*

  # Main annotation file.
  # CDS and UTR regions are defined in this file.
  regions: ./rf_sample_data/annotation/appris_mouse_v1_actual_regions.bed

  # Transcript lengths
  transcript_lengths: ./rf_sample_data/annotation/appris_mouse_v1_transcript_lengths.tsv
imilenkovic commented 3 years ago

Should I maybe provide the full path instead of "./" since I am running the pipeline with nextflow?

imilenkovic commented 3 years ago

Now it gave:

[a0/d12e25] NOTE: Missing output file(s) KO.1.transcriptome_alignment.bam expected by process transcriptome_alignment (6) -- Execution is retried (1)

hakanozadam commented 3 years ago

Yes, using the absolute ( full ) path is a safer option.

Also, going to the "work" folder and reading the log files (of the filter step) will be helpful. Those are typically hidden files whose file names start with a ".".

hakanozadam commented 3 years ago

This is a good sign! This means the filtering step worked and it encountered and error in the transcriptome alignment step.

Going to the work folder work/a0 and checking out the log files therein might also be helpful.

imilenkovic commented 3 years ago

I would assume that it's failing to generate the .bam files for some reason?

hakanozadam commented 3 years ago

Yes, it seems like so.

In samtools index -@ '{task.cpus}' KO.1.transcriptome_alignment.bam

The systems should have replaced '{task.cpus}' with the exact number. I wonder why this didn't happen.

Did you manage to run the "test run", that was provided in the documentation? Did it go successfully?

imilenkovic commented 3 years ago

myjob.log I thought it did since I looked at the execution status, but now I see that it had similar errors; however it kept going. I wasn't sure if it was normal for everything to be done in only 12 minutes, but I thought since it was test data that it actually worked.

hakanozadam commented 3 years ago

I would check if the ribo file is generated properly. The sample data should run without any errors and I see that this was run via singularity. So I am not sure if singularity integration was seamless.

lucacozzuto commented 3 years ago

Dear all, I checked that in the code in some processes there is {task.cpus} instead of ${task.cpus}. See here as an example: https://github.com/ribosomeprofiling/riboflow/blob/df3e891b1f9f04259ea071da4ff3198d5ae6b242/RiboFlow.groovy#L260

I made a pull request to fix this (https://github.com/ribosomeprofiling/riboflow/pull/9) and I also added a new configuration for running the pipeline using singularity. I hope this can help.

Best,

imilenkovic commented 3 years ago

Hello, With the correction Luca made there are no more errors and the ribo files are produced; however, the whole run took only 6 minutes (and the input files are 6 illumina fastq files with ~30M reads each). Is this normal or is something still off? RiboSeq_Mirko4.log

lucacozzuto commented 3 years ago

Hello Ivan, this because is doing caching. Can you do it from scratch?

imilenkovic commented 3 years ago

Hello,

I managed to run the pipeline successfully; however, what I have noticed is that the number of "after_dedup" reads and the reads I see in the .ribo objects when I load them with ribor are quite different:

after_dedup, 492314, 349768 ribor total.reads: 2759, 2239

Do you maybe have an idea where all the missing reads go, and why I end up with such a low number of reads? I attached the stats file from the output folder. Thanks! stats.txt

hakanozadam commented 3 years ago

Hello,

Could you also share your parameters file ( the yaml file you provided input fastqs, mouse reference etc )?

imilenkovic commented 3 years ago

project.yaml.txt

Here - added the .txt so that I can upload it here.

hakanozadam commented 3 years ago

This low number might be due to the read length range. I suggest changing the min read length to 15 and max read length to 40. Could you rerun the pipeline after this change and check the ribo file again? Also, looking at the read length distribution will be very helpful. If there are many ribosome footprints of length ,say greater than 33, they won't make it to the ribo file in the current settings. By modifying the parameters, you can include them.

More explicitly, in the following section of the project.yaml file, I recommend setting min to 15 and max to 40

# old version
# RiboPy parameters for ribo file generation.
ribo:
    ref_name:        "appris-v1"
    metagene_radius: 50
    left_span:       35
    right_span:      10
    read_length:
       min: 28
       max: 32
    coverage: true

So after the recommended changes, it should be as follows:

# new version
# RiboPy parameters for ribo file generation.
ribo:
    ref_name:        "appris-mouse-v1"
    metagene_radius: 50
    left_span:       35
    right_span:      10
    read_length:
       min: 15
       max: 40
    coverage: true
imilenkovic commented 3 years ago

I think this did the trick, thanks a lot!

imilenkovic commented 3 years ago

I just saw that you also changed the ref_name - is this important or it's just the way it's named? I assume it still reading the mouse files I specified elsewhere.

hakanozadam commented 3 years ago

No, reference name is not important. I just changed it so that mouse ribo files can easily be distinguished from human ribo files.

I am happy to hear that changing the read length solved the problem.

Thank you for providing feedback! It helps us a lot!