lemieux-lab / BamQuery

BamQuery provides RNA-seq quantification for a given peptide (from 8 to 11 residues) in chosen RNA-seq samples (single or bulk).
https://bamquery.iric.ca/
MIT License
4 stars 2 forks source link

no output for biotype result? #3

Open sundysun99 opened 1 year ago

sundysun99 commented 1 year ago

Hi, I am trying to use this software to analyze my MS data. But I'm not sure if the calculation is finished, especially since none of the biotype related results are output.

Here are my input, BAM_directories.tsv:

0_treat_1   /xx1/2stpass_sample-1_allAligned.sortedByCoord.out.bam
0_treat_2   /xx2/2stpass_sample-2_allAligned.sortedByCoord.out.bam
0_treat_3   /xx3/2stpass_sample-3_allAligned.sortedByCoord.out.bam
48_treat_1  /xx4/2stpass_sample-4_allAligned.sortedByCoord.out.bam
48_treat_2  /xx5/2stpass_sample-5_allAligned.sortedByCoord.out.bam
48_treat_3  /xx6/2stpass_sample-6_allAligned.sortedByCoord.out.bam

peptides.tsv:(600+ peptide)

LVSPPTKFV   Mouse_1st_9_peptide
WLGMKDEYT   Mouse_1st_9_peptide
TKPKLGLPE   Mouse_1st_9_peptide
RTKTKKCVP   Mouse_1st_9_peptide
VVPTPLKPH   Mouse_1st_9_peptide
WVAKGRGPV   Mouse_1st_9_peptide
LLQETAPVV   Mouse_1st_9_peptide
...

And log file is

2023-10-30 10:52:59,089 INFO Path to input folder : /xx/Input/ 
2023-10-30 10:52:59,089 INFO Path to output folder : /xx/output/ 
2023-10-30 10:52:59,089 INFO =============== # ===================
2023-10-30 10:52:59,089 INFO =============== Start Parameters ===============
2023-10-30 10:52:59,089 INFO  - BamQuery id : SB_9peptide 
2023-10-30 10:52:59,089 INFO  - Mode : normal, Strandedness :  False, Light:  False 
2023-10-30 10:52:59,089 INFO  - Single-Cell experiment (sc) :  False, Count UMIs : False
2023-10-30 10:52:59,089 INFO  - dbSNP :  mouse_GRCm38, COMMON SNPs : False, Genome Version : M24 
2023-10-30 10:52:59,089 INFO  - Plots : True
2023-10-30 10:52:59,089 INFO  - Keep Variant Alignments : False, Keep High Amount Alignments : False
2023-10-30 10:52:59,089 INFO  - Counting overlapping reads : False
2023-10-30 10:52:59,089 INFO  - Mouse Genome : True
2023-10-30 10:52:59,090 INFO  - Threads : 12
2023-10-30 10:52:59,090 INFO =============== End Parameters ===============
2023-10-30 10:54:42,002 INFO 
2023-10-30 10:54:42,002 INFO 
2023-10-30 10:54:42,002 INFO BamQuery analysis will continue where it left....
2023-10-30 10:54:42,002 INFO Path to input folder : /xx/Input/ 
2023-10-30 10:54:42,002 INFO Path to output folder : /xx/output/ 
2023-10-30 10:54:42,002 INFO =============== # ===================
2023-10-30 10:54:42,002 INFO =============== Start Parameters ===============
2023-10-30 10:54:42,002 INFO  - BamQuery id : SB_9peptide 
2023-10-30 10:54:42,002 INFO  - Mode : normal, Strandedness :  False, Light:  False 
2023-10-30 10:54:42,002 INFO  - Single-Cell experiment (sc) :  False, Count UMIs : False
2023-10-30 10:54:42,002 INFO  - dbSNP :  mouse_GRCm38, COMMON SNPs : False, Genome Version : M24 
2023-10-30 10:54:42,002 INFO  - Plots : True
2023-10-30 10:54:42,002 INFO  - Keep Variant Alignments : False, Keep High Amount Alignments : False
2023-10-30 10:54:42,003 INFO  - Counting overlapping reads : False
2023-10-30 10:54:42,003 INFO  - Mouse Genome : True
2023-10-30 10:54:42,003 INFO  - Threads : 12
2023-10-30 10:54:42,003 INFO =============== End Parameters ===============
2023-10-30 10:54:42,048 INFO Total Bam Files to Query : 6.
2023-10-30 10:54:42,053 INFO Peptides to evaluate in Peptide Mode : 613
2023-10-30 10:54:42,053 INFO Peptides to evaluate in Coding Sequence (CS) Mode : 0
2023-10-30 10:54:42,053 INFO Peptides to evaluate in Manual Mode: 0
2023-10-30 10:54:42,053 INFO Total Peptides to evaluate : 613
2023-10-30 10:54:42,053 INFO ========== Treatment File : Done! ============ 
2023-10-30 10:55:21,410 INFO Total Coding Sequences : 63476032
Total time run function reverse_translation to end : 0.6558857679367065min
2023-10-30 10:55:21,413 INFO ========== Reverse Translation : Done! ============ 
2023-10-30 10:55:21,630 INFO Command to Align using STAR : ulimit -s 8192; STAR --runThreadN 12 --genomeDir /xx/bamquery/lib/genome_versions/genome_mouse_m24/Index_STAR_2.7.9a/ --seedSearchStartLmax 20 --alignEndsType EndToEnd --sjdbOverhang 32 --alignSJDBoverhangMin 1 --alignSJoverhangMin 10000 --outFilterMismatchNmax 4 --outFilterIntronMotifs RemoveNoncanonicalUnannotated --scoreGapNoncan -50 --outFilterType BySJout --winAnchorMultimapNmax 1000 --outFilterMultimapNmax 1000000 --readFilesIn /xx/output/genome_alignments/SB_9peptide.fastq --outSAMattributes NH HI MD --limitOutSJcollapsed 5000000 --limitOutSAMoneReadBytes 266000000 --outFilterMultimapScoreRange 2 --alignTranscriptsPerWindowNmax 1000 --alignWindowsPerReadNmax 15000 --seedNoneLociPerWindow 1000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 20000 --outFileNamePrefix /xx/output/genome_alignments/ 
2023-10-30 11:31:22,279 INFO Total time run function alignment_CS_to_genome to end : 36.010810 min
2023-10-30 11:31:22,491 INFO Using genome version /xx/bamquery/lib/genome_versions/genome_mouse_m24/GRCm38.primary_assembly.genome.fa. 
2023-10-30 11:31:22,494 INFO Using dbSNP database mouse_GRCm38 with COMMON SNPs = False. Database Path : /xx/bamquery/lib//snps/snps_dics_mouse_GRCm38/ 
2023-10-30 11:31:58,345 INFO Time reading SAM file : 0.597516 min Chromosomes 65 
2023-10-30 11:38:50,853 INFO Total time run function get_alignments to end : 7.476200 min
2023-10-30 11:38:50,854 INFO Total alignments : 423 
2023-10-30 11:38:53,474 INFO Alignments Information save to : /xx/output/alignments/Alignments_information.dic 
2023-10-30 11:38:53,481 INFO ========== Alignment : Done! ============ 
2023-10-30 11:38:53,483 INFO Total missed_peptides : 413. Find the list in : /xx/output/alignments/missed_peptides.info.
2023-10-30 11:38:53,483 INFO ========== Common_to_modes : Done! ============ 
2023-10-30 11:38:53,486 INFO Total MCS mapped : 423 
2023-10-30 11:38:53,487 INFO Total unique regions : 353 
2023-10-30 11:38:53,666 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-1_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:38:56,509 INFO Processed Bam File : 0 0_treat_bam_0_treat_1. Time : 0.048571 min
2023-10-30 11:38:56,573 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-4_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:38:59,397 INFO Processed Bam File : 1 48_treat_bam_48_treat_1. Time : 0.048127 min
2023-10-30 11:38:59,471 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-2_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:39:02,172 INFO Processed Bam File : 2 0_treat_bam_0_treat_2. Time : 0.046251 min
2023-10-30 11:39:02,247 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-5_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:39:05,199 INFO Processed Bam File : 3 48_treat_bam_48_treat_2. Time : 0.050428 min
2023-10-30 11:39:05,264 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-3_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:39:07,788 INFO Processed Bam File : 4 0_treat_bam_0_treat_3. Time : 0.043154 min
2023-10-30 11:39:07,842 INFO It was not possible to associate these chromosomes in the genome annotations reference: GL456210.1,GL456211.1,GL456212.1,GL456213.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456359.1,GL456360.1,GL456366.1,GL456367.1,GL456368.1,GL456370.1,GL456372.1,GL456378.1,GL456379.1,GL456381.1,GL456382.1,GL456383.1,GL456385.1,GL456387.1,GL456389.1,GL456390.1,GL456392.1,GL456393.1,GL456394.1,GL456396.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584300.1,JH584301.1,JH584302.1,JH584303.1,JH584304.1 to any chr in the bam file investigated: /xx/2stpass_sample-6_allAligned.sortedByCoord.out.bam. 
Regions in these chromosomes will be associated with a count of zero.
2023-10-30 11:39:10,805 INFO Processed Bam File : 5 48_treat_bam_48_treat_3. Time : 0.050287 min
2023-10-30 11:39:10,848 INFO Average time to process a BamFile : 0.047803 min
2023-10-30 11:39:10,887 INFO Counts Information saved to : /xx/output/res/temps_files/SB_9peptide_rna_count.csv 
2023-10-30 11:39:10,887 INFO Total time run function get_counts to end : 0.290035 min
VirginieR commented 1 year ago

Hi Sundy,

I have a suggestion and also some questions for you.

First of all, I noticed that you are querying 613 peptides and only for 200 of them you have found alignments in the genome. So I suggest you include in the search the parameters --var and --maxmm, this will hopefully help to find locations for the missing peptides.

I searched for your peptides and when I am not using the two parameters 6/7 peptides are missed but when I add those two parameters only 1 peptide is missed. So the parameters should look like this:

2023-11-02 21:23:29,897 INFO =============== Start Parameters ===============
2023-11-02 21:23:29,897 INFO  - BamQuery id : mouse_example 
2023-11-02 21:23:29,897 INFO  - Mode : normal, Strandedness :  False, Light:  False 
2023-11-02 21:23:29,897 INFO  - Single-Cell experiment (sc) :  False, Count UMIs : False
2023-11-02 21:23:29,897 INFO  - dbSNP :  mouse_GRCm38, COMMON SNPs : False, Genome Version : M24 
2023-11-02 21:23:29,897 INFO  - Plots : True
2023-11-02 21:23:29,897 INFO  - Keep Variant Alignments : True, Keep High Amount Alignments : True
2023-11-02 21:23:29,897 INFO  - Counting overlapping reads : False
2023-11-02 21:23:29,897 INFO  - Mouse Genome : True
2023-11-02 21:23:29,897 INFO  - Threads : 4
2023-11-02 21:23:29,897 INFO =============== End Parameters ===============

Also, I wanted to make sure that the bam files you are querying the peptides in were aligned in the murine GRCm38 genome.

Lastly, did you get any errors? It looks like BamQuery didn't finish correctly, that's why you don't have the biotype classification, but the console should indicate if there were any errors that prevented BamQuery from finishing.

sundysun99 commented 1 year ago

Hi, thank you for reply.

(1) yes, murine GRCm38 genome is the ref I use (2)I tried two sets of bam files, the first one I added multiple custom parameters when aligned with STAR, and the second one I didn't add any parameters, and neither of them output the biotype inference. It is assumed that the bam file format is the cause of the error. Seems there is no mention of bam file requirements in the tutorial. Is there some information I'm missing? (3) --var and --maxmm were used, but still 200/613 could successfully aligned. but when I try different bam (STAR run with no additional parameters), the aligned peptide number changed to 500+ .

Here is the error I got (bam with multiple parameters by STAR) Traceback (most recent call last): File "/xx/bamquery/BamQuery/BamQuery.py", line 646, in <module> main(sys.argv[1:]) File "/xx/bamquery/BamQuery/BamQuery.py", line 604, in main BamQuery(path_to_input_folder, path_to_output_folder, name_exp, mode, strandedness, th_out, light, dev, plots, dbSNP, c, super_logger, bam_files_logger, sc, umi, var, maxmm, genome_version, overlap, mouse, t) File "/xx/bamquery/BamQuery/BamQuery.py", line 53, in __init__ self.run_bam_query_normal_mode(bam_files_logger) File "/xx/bamquery/BamQuery/BamQuery.py", line 103, in run_bam_query_normal_mode plots.get_heat_map(df_counts_rna, self.path_to_output_folder+'plots/heat_maps/transcription_evidence_heatmap/', self.mode, path_temps_file, self.name_exp, '_rna_counts', False, self.th_out) File "/xx/bamquery/BamQuery/plotting/plots.py", line 85, in get_heat_map tissue = bam_files_info_query[sample][6] KeyError: 'Position'

another try with different bam (STAR run with no additional parameters), the error is Traceback (most recent call last): File "/xx/bamquery/BamQuery/genomics/normalization.py", line 61, in get_normalization info_bam_file = dictionary_total_reads_bam_files[name_bam_file] KeyError: 'Position' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/xx/bamquery/BamQuery/BamQuery.py", line 646, in main(sys.argv[1:]) File "/xx/bamquery/BamQuery/BamQuery.py", line 604, in main BamQuery(path_to_input_folder, path_to_output_folder, name_exp, mode, strandedness, th_out, light, dev, plots, dbSNP, c, super_logger, bam_files_logger, sc, umi, var, maxmm, genome_version, overlap, mouse, t) File "/xx/bamquery/BamQuery/BamQuery.py", line 53, in init self.run_bam_query_normal_mode(bam_files_logger) File "/xx/bamquery/BamQuery/BamQuery.py", line 108, in run_bam_query_normal_mode def_norm_rna = normalization.get_normalization(df_counts_rna, '_rna_norm.csv') File "/xx/bamquery/BamQuery/genomics/normalization.py", line 84, in get_normalization raise Exception("\nBefore to continue you need to verify that the primary read count for the bam file "+name_bam_file+" is already included in the dictionary. To do so: verify in the log Get_Read_Count_BAM_directories.log that the primary read count processes have finished. Please re-launch BamQuery, once all the primary read counts have been included." ) Exception: Before to continue you need to verify that the primary read count for the bam file Position is already included in the dictionary. To do so: verify in the log Get_Read_Count_BAM_directories.log that the primary read count processes have finished. Please re-launch BamQuery, once all the primary read counts have been included.

Then I check the code, my bam_files_info_query.dic file is >>> bam_files_info_query {'bam_file_sample-1_all': ['/xx/bam_file/sample-1_all/2stpass_sample-1_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '0_treat_1', 83803720, 'R1', '0h'], 'bam_file_sample-2_all': ['/xx/bam_file/sample-2_all/2stpass_sample-2_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '0_treat_2', 73483150, 'R2', '0h'], 'bam_file_sample-3_all': ['/xx/bam_file/sample-3_all/2stpass_sample-3_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '0_treat_3', 73895430, 'R3', '0h'], 'bam_file_sample-4_all': ['/xx/bam_file/sample-4_all/2stpass_sample-4_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '48_treat_1', 70786450, 'R1', '48h'], 'bam_file_sample-5_all': ['/xx/bam_file/sample-5_all/2stpass_sample-5_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '48_treat_2', 75071165, 'R2', '48h'], 'bam_file_sample-6_all': ['/xx/bam_file/sample-6_all/2stpass_sample-6_allAligned.sortedByCoord.out.bam', 'unstranded', 'unstranded', '48_treat_3', 77277771, 'R3', '48h']}

VirginieR commented 1 year ago

Hi Sundy,

I'm a little confused with your answers. So I think it would be easier to address the problem if we schedule a call, so we can make sure we talk about the same issues. Could you write to me at: maria.virginia.ruiz.cuevas@umontreal.ca ?

Thank you!