smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

Unable to run learn-cutoff function #134

Closed bsms9e0n closed 11 months ago

bsms9e0n commented 1 year ago

Description

I have tried running the learn-cutoff function to determine the phase score I should be using.

Ribotricer is being run through Conda and I am using the most up-to-date version (updated 10.01.23)

What I Did

Command ran: ribotricer learn-cutoff --ribo_bams /pathtoriboBAM/Ribosome_Footprint_norna_match.sorted.bam \ --rna_bams /pathtornabam/Ribosome_Footprint_mRNA_norna_match.sorted.bam \ --prefix ribo_rna_prefix \ --ribotricer_index /pathtocandidateORFs/RiboTricer_step1_candidate_orfs.tsv

If there was a crash, please include the traceback here.


Got unexpected extra arguments ( --ribotricer_index /path/RiboTricer_step1_candidate_orfs.tsv  --rna_bams /path/Ribosome_Footprint_mRNA_norna_match.sorted.bam  --prefix ribo_rna_prefix)

I cant find what this unexpected argument is. I have tried adding in lots of other options but it still results in the same error.

Thanks so much for your help!
saketkc commented 1 year ago

I think it is line ending issue. You could write the entire command in one line:


ribotricer learn-cutoff --ribo_bams /pathtoriboBAM/Ribosome_Footprint_norna_match.sorted.bam --rna_bams /pathtornabam/Ribosome_Footprint_mRNA_norna_match.sorted.bam --prefix ribo_rna_prefix --ribotricer_index /pathtocandidateORFs/RiboTricer_step1_candidate_orfs.tsv```
bsms9e0n commented 1 year ago

Thanks for getting back so quickly. It now runs! Although it is now tripping up at this next error - any idea what this is? code output below:

(ribotricer) -bash-4.2$ ribotricer learn-cutoff --ribo_bams /mnt/nfs2/bsms/bsms9e0n/AG298_RiboSeq/Python/libraries/density/filtering_bowtie/alignments/chr/Ribosome_Footprint_norrna_match.sorted.bam --rna_bams /mnt/nfs2/bsms/bsms9e0n/AG298_RiboSeq/Python/libraries/density/filtering_bowtie/alignments/chr/Ribosome_Footprint_mRNA_norrna_match.sorted.bam --prefix ribo_rna_prefix --ribotricer_index /mnt/nfs2/bsms/bsms9e0n/AG298_RiboSeq/RiboTricer/RiboTricer_step1_candidate_orfs.tsv Running ribotricer on 1 Ribo-seq sample .....

Jan 10 12:37:22 ..... started ribotricer detect-orfs Jan 10 12:37:22 ... started parsing ribotricer index file Jan 10 12:37:22 ... started inferring experimental design Jan 10 12:37:59 ... started reading bam file Jan 10 12:39:07 ... started plotting read length distribution Jan 10 12:39:07 ... started calculating metagene profiles. This may take a long time...

Jan 10 12:41:07 ... started plotting metagene profiles Jan 10 12:41:11 ... started inferring P-site offsets Jan 10 12:41:11 ... started shifting according to P-site offsets Jan 10 12:41:13 ... started exporting wig file of alignments after shifting Jan 10 12:41:14 ... started calculating phase scores for each ORF 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4771/4772 [00:11<00:00, 406.35ORFs/s] Jan 10 12:41:26 ... finished ribotricer detect-orfs Running ribotricer on 1 RNA-seq sample .....

Jan 10 12:41:26 ..... started ribotricer detect-orfs Jan 10 12:41:26 ... started parsing ribotricer index file Jan 10 12:41:26 ... started inferring experimental design Jan 10 12:41:58 ... started reading bam file Jan 10 12:43:06 ... started plotting read length distribution Jan 10 12:43:06 ... started calculating metagene profiles. This may take a long time...

Jan 10 12:44:23 ... started plotting metagene profiles Jan 10 12:44:26 ... started inferring P-site offsets Jan 10 12:44:26 ... started shifting according to P-site offsets Jan 10 12:44:27 ... started exporting wig file of alignments after shifting Jan 10 12:44:28 ... started calculating phase scores for each ORF 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4771/4772 [00:12<00:00, 373.40ORFs/s] Jan 10 12:44:41 ... finished ribotricer detect-orfs Traceback (most recent call last): File "/its/home/bsms9e0n/.conda/envs/ribotricer/bin/ribotricer", line 10, in sys.exit(cli()) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/ribotricer/cli.py", line 505, in determine_cutoff_cmd report_all=True, File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/ribotricer/learn_cutoff.py", line 244, in determine_cutoff_bam determine_cutoff_tsv(ribo_tsvs, rna_tsvs, filter_by, sampling_ratio, reps) File "/its/home/bsms9e0n/.conda/envs/ribotricer/lib/python3.7/site-packages/ribotricer/learn_cutoff.py", line 92, in determine_cutoff_tsv ribo_phase_median_samples = np.median(ribo_phase_scores_all[ribo_indices], axis=0) IndexError: arrays used as indices must be of integer (or boolean) type (ribotricer) -bash-4.2$

saketkc commented 1 year ago

Do you have just one replicate that you want to learn the cutoff on?

bsms9e0n commented 1 year ago

Yes, just one replicate. I’m running this using a Mycobacterium tuberculosis genome, so I’m trying to find a suitable phase score to use in case it is different from the recommended options. Thank you!

saketkc commented 1 year ago

I am unable to reproduce this using the dataset we have (https://www.dropbox.com/s/xr0xsdluuni2b95/ribotricer_test_data_tair10.zip). If you can send your 1) ribo bam 2) rna bam and 3) ribotricer index, I can debug. My email is schoudhary@nygenome.org.

Here is what I ran:


$ ribotricer learn-cutoff --ribo_bams bams_unique/SRX219170.bam --rna_bams bams_unique/SRX347226.bam --prefix ribo_rna_prefix --ribotricer_index index/ribotricer_v44_annotation_longest_candidate_orfs.tsv

... [truncated]
Jan 10 11:10:13 ... started plotting metagene profiles
Jan 10 11:10:16 ... started inferring P-site offsets
Jan 10 11:10:16 ... started shifting according to P-site offsets
Jan 10 11:10:25 ... started exporting wig file of alignments after shifting
Jan 10 11:10:30 ... started calculating phase scores for each ORF
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 397338/397339 [09:07<00:00, 726.38ORFs/s]
Jan 10 11:19:38 ... finished ribotricer detect-orfs
sampling_ratio: 0.33
n_samples: 20000
ribo_phase_score_mean: 0.508
ribo_phase_score_median: 0.508
ribo_phase_score_sd: 0.001
rna_phase_score_mean: 0.087
rna_phase_score_median: 0.087
rna_phase_score_sd: 0.001
diff_phase_score_sampled_mean: 0.421
diff_phase_score_sampled_median: 0.421
diff_phase_score_sampled_sd: 0.001
diff_phase_score_all_mean: 0.291
diff_phase_score_all_median: 0.398
diff_phase_score_all_sd: 0.345
recommended_cutoff: 0.421
saketkc commented 1 year ago

thanks for sharing the files over email. Your index (and your gtf) has assumed_protein_coding as the gene_type. learn-cutoff looks for protein_coding genes. You should be able to run the step by first creating a modified index:

 $ cat RiboTricer_step1_candidate_orfs.tsv|  sed 's/assumed_protein_coding/protein_coding/g'  > modified_index.tsv

and then passing the modified_index.tsv to learn-cutoff:

$ ribotricer learn-cutoff --ribo_bams Ribo_bam_sorted.bam --rna_bams RNA_bam_sorted.bam --prefix aaron --ribotricer_index modified_index.tsv

sampling_ratio: 0.33
n_samples: 20000
ribo_phase_score_mean: 0.159
ribo_phase_score_median: 0.159
ribo_phase_score_sd: 0.002
rna_phase_score_mean: 0.154
rna_phase_score_median: 0.154
rna_phase_score_sd: 0.003
diff_phase_score_sampled_mean: 0.005
diff_phase_score_sampled_median: 0.005
diff_phase_score_sampled_sd: 0.004
diff_phase_score_all_mean: 0.010
diff_phase_score_all_median: 0.002
diff_phase_score_all_sd: 0.141
recommended_cutoff: 0.005

It seems the RNA and Ribo periodicity are very close. Have you looked at the metagene plots outputted by ribotricer to see if the values make sense?

saketkc commented 1 year ago

The other possibility would be that the Ribo bam is actually an RNA bam?

To give you some more context, we have observed the cutoff to vary based on species. But we do expect these scores to be around 0.31-0.41. Here's an example in Candida albicans (fungi) , where the default cutoff of 0.41 seems to work nicely: https://pubmed.ncbi.nlm.nih.gov/33585865/ . You can also look at Figure 1 for an example of the contrast in periodicity between ribo and rna samples.

bsms9e0n commented 1 year ago

Thanks for getting back to my so quickly, this looks super interesting! I've checked back at my QC data and IGV and they are definitely the correct way round. I haven't yet been successful with the metagene analysis. Because this is bacteria, we've used micrococcal nuclease for digestion since RNase I does not work, and needs 3' alignment for p-site determination.

Could this possibly affect this learn cutoff script, if its aligning to 5' end instead? Is there a way to change this?

Thanks again so much!

saketkc commented 1 year ago

Sorry for the delay.

Yes, by default ribotricer performs shifting using the 3' end. What would be the ideal offsets for your data, i.e. for say raed lengths 28-32, by how much should the 5' site be shifted to estimate the P-sites?

The functionality is currently supported in ribotricer, but not exposed to the user.

bsms9e0n commented 1 year ago

Thanks for getting back to me - The ideal offset for our data is 33 nt for this ribo sample - this is the distance we see the highest ribosome peak from the start codon when aligned to the 5' (which is in line with the literature, where huge variation has been reported in bacteria).

Thats really good to know, thanks! Is there an easy way I can change the function? Just within one of the Py scripts?

Cheers

saketkc commented 1 year ago

It is easy, but it needs to be propogated throughout the pipeline. The starting point would be https://github.com/smithlabcode/ribotricer/blob/4e2f001faddfb946ce3ab0abf108cb31543a23c1/ribotricer/detect_orfs.py#L133 - basically every call to this method will require passing in a user provided 5' offset.

lananh-ngn commented 1 year ago

Hello, I apologize in advance if it is not the right place to post my problem but I encountered the same problem:

Mar 22 18:56:21 ..... started ribotricer detect-orfs
Mar 22 18:56:21 ... started parsing ribotricer index file
Mar 22 18:56:22 ... started inferring experimental design                       
Mar 22 19:15:20 ... started reading bam file
Mar 22 20:15:45 ... started plotting read length distribution                   
Mar 22 20:15:46 ... started calculating metagene profiles. This may take a long time...

Mar 22 20:19:16 ... started plotting metagene profiles                          
Mar 22 20:19:23 ... started inferring P-site offsets
Mar 22 20:19:23 ... started shifting according to P-site offsets
Mar 22 20:19:35 ... started exporting wig file of alignments after shifting
Mar 22 20:19:40 ... started calculating phase scores for each ORF
100%|█████████████████████████████████▉| 53535/53536 [02:02<00:00, 435.92ORFs/s]
Mar 22 20:21:43 ... finished ribotricer detect-orfs
Running ribotricer on 1 RNA-seq sample ..... 

Mar 22 20:21:47 ..... started ribotricer detect-orfs
Mar 22 20:21:47 ... started parsing ribotricer index file
Mar 22 20:21:47 ... started inferring experimental design                       
Mar 22 21:42:30 ... started reading bam file
Mar 23 00:16:43 ... started plotting read length distribution                   
Mar 23 00:16:44 ... started calculating metagene profiles. This may take a long time...

Mar 23 00:19:33 ... started plotting metagene profiles                          
Mar 23 00:19:49 ... started inferring P-site offsets
Mar 23 00:19:51 ... started shifting according to P-site offsets
Mar 23 00:20:08 ... started exporting wig file of alignments after shifting
Mar 23 00:21:36 ... started calculating phase scores for each ORF
100%|█████████████████████████████████▉| 53535/53536 [04:00<00:00, 222.60ORFs/s]
Mar 23 00:25:38 ... finished ribotricer detect-orfs
Traceback (most recent call last):
  File "/home/lana/anaconda3/envs/ribotricer_env/bin/ribotricer", line 10, in <module>
    sys.exit(cli())
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/ribotricer/cli.py", line 506, in determine_cutoff_cmd
    report_all=True,
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/ribotricer/learn_cutoff.py", line 244, in determine_cutoff_bam
    determine_cutoff_tsv(ribo_tsvs, rna_tsvs, filter_by, sampling_ratio, reps)
  File "/home/lana/anaconda3/envs/ribotricer_env/lib/python3.7/site-packages/ribotricer/learn_cutoff.py", line 92, in determine_cutoff_tsv
    ribo_phase_median_samples = np.median(ribo_phase_scores_all[ribo_indices], axis=0)
IndexError: arrays used as indices must be of integer (or boolean) type

Could you help? I suspect the gtf might not be the right format because it is not a gencode gtf. Thanks in advance.

saketkc commented 1 year ago

@lananh-ngn If you can send your 1) ribo bam 2) rna bam and 3) ribotricer index, I can debug. My email is schoudhary@nygenome.org.

michellelpeters commented 1 year ago

I was just about to post - I am also encountering the same error! Could you please share if you figure out how to debug?

lananh-ngn commented 1 year ago

@lananh-ngn If you can send your 1) ribo bam 2) rna bam and 3) ribotricer index, I can debug. My email is schoudhary@nygenome.org.

@saketkc Thank you for coming back to me, just before downsampling and sending you the documents, I realized that despite my gtf file having gene_type to protein_coding, in the index file, it was written assuming_protein_coding. After changing to protein_coding as you mentioned earlier it worked. Thank you for coming back that quickly.

@michellelpeters I'm tagging you to let you know how I solved it, hoping it would help.

michellelpeters commented 1 year ago

@lananh-ngn Thanks -- just checked and the same happened with my index file!