iMetOsaka / UNAGI

3 stars 4 forks source link

Error in final output #3

Closed aman21392 closed 3 years ago

aman21392 commented 3 years ago

Hi, I send you the command which i run and the output of transitionnal directory--

python3 /home/aclab/apps/UNAGI/app/unagi.py -i /Drive4/E6/combined_T.fastq -g /Drive1/human_index/Homo_sapiens.GRCh38.dna.toplevel.fa -o /Drive4/E6/UNAGI/ -V

Only these files are made and there is not any final output.

negative_3dash_positions.txt negative_genelist_incomplete.bed positive_5dash_positions.txt positive_mapped.bam raw_sorted.bam negative_5dash_positions.txt negative_mapped.bam positive.bed positive_mapped.sam raw_sorted.bed negative.bed negative_mapped.sam positive_coverage.txt positive_sorted.bam raw_sorted_split.bed negative_coverage.txt negative_sorted.bam positive.fasta raw_mapped.bam stranded.fastq negative.fasta positive_3dash_positions.txt positive_genelist_incomplete.bed raw_mapped.sam total_coverage.txt

[M::mm_idx_gen::61.1941.88] collected minimizers [M::mm_idx_gen::86.7472.19] sorted minimizers [M::main::86.7472.19] loaded/built the index for 194 target sequence(s) [M::mm_mapopt_update::90.1932.14] mid_occ = 765 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 194 [M::mm_idx_stat::92.6582.11] distinct minimizers: 167225302 (35.46% are singletons); average occurrences: 6.030; average spacing: 3.074; total length: 3099750718 [M::worker_pipeline::99.5912.16] mapped 49817 sequences [M::worker_pipeline::100.3422.15] mapped 49817 sequences [M::main] Version: 2.17-r974-dirty [M::main] CMD: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/stranded.fastq [M::main] Real time: 100.343 sec; CPU: 216.128 sec; Peak RSS: 18.500 GB index file /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai not found, generating... [M::mm_idx_gen::64.7711.91] collected minimizers [M::mm_idx_gen::93.1832.22] sorted minimizers [M::main::93.1832.22] loaded/built the index for 194 target sequence(s) [M::mm_mapopt_update::97.9792.16] mid_occ = 765 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 194 [M::mm_idx_stat::99.8862.14] distinct minimizers: 167225302 (35.46% are singletons); average occurrences: 6.030; average spacing: 3.074; total length: 3099750718 [M::worker_pipeline::101.7332.15] mapped 1292 sequences [M::worker_pipeline::102.0442.15] mapped 1292 sequences [M::main] Version: 2.17-r974-dirty [M::main] CMD: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/positive.fasta [M::main] Real time: 102.044 sec; CPU: 219.038 sec; Peak RSS: 18.500 GB [M::mm_idx_gen::62.6671.85] collected minimizers [M::mm_idx_gen::91.6082.17] sorted minimizers [M::main::91.6082.17] loaded/built the index for 194 target sequence(s) [M::mm_mapopt_update::96.6472.11] mid_occ = 765 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 194 [M::mm_idx_stat::98.6932.08] distinct minimizers: 167225302 (35.46% are singletons); average occurrences: 6.030; average spacing: 3.074; total length: 3099750718 [M::worker_pipeline::100.3752.10] mapped 1058 sequences [M::worker_pipeline::100.546*2.10] mapped 1058 sequences [M::main] Version: 2.17-r974-dirty [M::main] CMD: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/negative.fasta [M::main] Real time: 100.546 sec; CPU: 210.758 sec; Peak RSS: 18.500 GB


*****WARNING: Genome (-g) files are ignored when BAM input is provided.



*****WARNING: Genome (-g) files are ignored when BAM input is provided.


[2020/10/25 - 13:41:35] Reading the input file [2020/10/25 - 13:43:59] Finding the different strands in the input reads [2020/10/25 - 13:43:59] Finding the forwards reads. [2020/10/25 - 13:44:02] 17973 forwards reads found [2020/10/25 - 13:44:02] Finding the backwards reads. [2020/10/25 - 13:44:04] 31844 backwards reads found [2020/10/25 - 13:44:04] Getting the reverse complement for the backwards reads. [2020/10/25 - 13:44:08] A total of 49817 records out of 1201732 (4%) were successfully stranded [2020/10/25 - 13:44:12] Mapping the reads to the genome [2020/10/25 - 13:44:12] Running minimap with the following command: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/stranded.fastq > /Drive4/E6/UNAGI/transitionnal/raw_mapped.sam [2020/10/25 - 13:45:53] Finished running minimap [2020/10/25 - 13:45:53] Sorting the mapped reads [2020/10/25 - 13:45:53] Running samtools with the following command: samtools view -Sb /Drive4/E6/UNAGI/transitionnal/raw_mapped.sam > /Drive4/E6/UNAGI/transitionnal/raw_mapped.bam [2020/10/25 - 13:45:57] Finished running samtools [2020/10/25 - 13:45:57] Running samtools with the following command: samtools sort /Drive4/E6/UNAGI/transitionnal/raw_mapped.bam -o /Drive4/E6/UNAGI/transitionnal/raw_sorted.bam > /Drive4/E6/UNAGI/transitionnal/raw_sorted.bam [2020/10/25 - 13:46:01] Finished running samtools [2020/10/25 - 13:46:01] Generating the positive and negative bed files [2020/10/25 - 13:46:01] Running bedtools with the following command: bedtools bamtobed -i /Drive4/E6/UNAGI/transitionnal/raw_sorted.bam > /Drive4/E6/UNAGI/transitionnal/raw_sorted.bed [2020/10/25 - 13:46:02] Finished running bedtools [2020/10/25 - 13:46:02] Running bedtools with the following command: bedtools bamtobed -split -i /Drive4/E6/UNAGI/transitionnal/raw_sorted.bam > /Drive4/E6/UNAGI/transitionnal/raw_sorted_split.bed [2020/10/25 - 13:46:04] Finished running bedtools [2020/10/25 - 13:46:04] Found %i positive entries and %i negative ones. [2020/10/25 - 13:46:04] Generating fasta files from our bed files and the genome [2020/10/25 - 13:46:04] Running bedtools with the following command: bedtools getfasta -fi /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed /Drive4/E6/UNAGI/transitionnal/positive.bed -fo /Drive4/E6/UNAGI/transitionnal/positive.fasta [2020/10/25 - 13:46:06] Finished running bedtools [2020/10/25 - 13:46:06] Running bedtools with the following command: bedtools getfasta -fi /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed /Drive4/E6/UNAGI/transitionnal/negative.bed -fo /Drive4/E6/UNAGI/transitionnal/negative.fasta [2020/10/25 - 13:46:07] Finished running bedtools [2020/10/25 - 13:46:07] Mapping the new fasta files to the genome [2020/10/25 - 13:46:07] Running minimap with the following command: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/positive.fasta > /Drive4/E6/UNAGI/transitionnal/positive_mapped.sam [2020/10/25 - 13:47:50] Finished running minimap [2020/10/25 - 13:47:50] Running minimap with the following command: minimap2 -ax splice -G 5k --secondary=no --split-prefix STR /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa /Drive4/E6/UNAGI/transitionnal/negative.fasta > /Drive4/E6/UNAGI/transitionnal/negative_mapped.sam [2020/10/25 - 13:49:32] Finished running minimap [2020/10/25 - 13:49:32] Sorting the new mapped reads [2020/10/25 - 13:49:32] Running samtools with the following command: samtools view -Sb /Drive4/E6/UNAGI/transitionnal/positive_mapped.sam > /Drive4/E6/UNAGI/transitionnal/positive_mapped.bam [2020/10/25 - 13:49:33] Finished running samtools [2020/10/25 - 13:49:33] Running samtools with the following command: samtools view -Sb /Drive4/E6/UNAGI/transitionnal/negative_mapped.sam > /Drive4/E6/UNAGI/transitionnal/negative_mapped.bam [2020/10/25 - 13:49:34] Finished running samtools [2020/10/25 - 13:49:34] Running samtools with the following command: samtools sort /Drive4/E6/UNAGI/transitionnal/positive_mapped.bam -o /Drive4/E6/UNAGI/transitionnal/positive_sorted.bam > /Drive4/E6/UNAGI/transitionnal/positive_sorted.bam [2020/10/25 - 13:49:34] Finished running samtools [2020/10/25 - 13:49:34] Running samtools with the following command: samtools sort /Drive4/E6/UNAGI/transitionnal/negative_mapped.bam -o /Drive4/E6/UNAGI/transitionnal/negative_sorted.bam > /Drive4/E6/UNAGI/transitionnal/negative_sorted.bam [2020/10/25 - 13:49:35] Finished running samtools [2020/10/25 - 13:49:35] Generating the genome coverage for each position [2020/10/25 - 13:49:35] Running bedtools with the following command: bedtools genomecov -d -ibam /Drive4/E6/UNAGI/transitionnal/positive_sorted.bam -g /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa > /Drive4/E6/UNAGI/transitionnal/positive_coverage.txt [2020/10/25 - 15:04:25] Finished running bedtools [2020/10/25 - 15:04:25] Running bedtools with the following command: bedtools genomecov -d -ibam /Drive4/E6/UNAGI/transitionnal/negative_sorted.bam -g /Drive1/new_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa > /Drive4/E6/UNAGI/transitionnal/negative_coverage.txt [2020/10/25 - 16:15:55] Finished running bedtools [2020/10/25 - 17:20:11] Determining genes start and end positions from the coverage [2020/10/25 - 17:49:27] Coverage Breaks: Found 194 chromosomes and 23 genes total. [2020/10/25 - 18:19:47] Coverage Breaks: Found 194 chromosomes and 20 genes total. [2020/10/25 - 18:19:47] Determining genes end positions from 3' coverage [2020/10/25 - 18:19:47] Running bedtools with the following command: bedtools genomecov -dz -3 -ibam /Drive4/E6/UNAGI/transitionnal/positive_sorted.bam > /Drive4/E6/UNAGI/transitionnal/positive_3dash_positions.txt [2020/10/25 - 18:20:05] Finished running bedtools [2020/10/25 - 18:20:05] Running bedtools with the following command: bedtools genomecov -dz -3 -ibam /Drive4/E6/UNAGI/transitionnal/negative_sorted.bam > /Drive4/E6/UNAGI/transitionnal/negative_3dash_positions.txt [2020/10/25 - 18:20:22] Finished running bedtools [2020/10/25 - 18:20:22] Determining genes start positions from 5' coverage [2020/10/25 - 18:20:22] Running bedtools with the following command: bedtools genomecov -dz -5 -ibam /Drive4/E6/UNAGI/transitionnal/positive_sorted.bam > /Drive4/E6/UNAGI/transitionnal/positive_5dash_positions.txt [2020/10/25 - 18:20:39] Finished running bedtools [2020/10/25 - 18:20:39] Running bedtools with the following command: bedtools genomecov -dz -5 -ibam /Drive4/E6/UNAGI/transitionnal/negative_sorted.bam > /Drive4/E6/UNAGI/transitionnal/negative_5dash_positions.txt [2020/10/25 - 18:20:57] Finished running bedtools [2020/10/25 - 18:20:57] Intersecting genes start and end positions from the coverage analysis and cuts Traceback (most recent call last): File "/home/aclab/apps/UNAGI/app/unagi.py", line 772, in main(sys.argv[1:]) File "/home/aclab/apps/UNAGI/app/unagi.py", line 163, in main positiveChrList = insertDashCuts(os.path.join(transitionnalOutputPath,config["positive_3dash_file"]), os.path.join(transitionnalOutputPath,config["positive_5dash_file"]), positiveChrList, os.path.join(transitionnalOutputPath,config["positive_genelist_file"]), "+") File "/home/aclab/apps/UNAGI/app/unagi.py", line 299, in insertDashCuts currentChrList.append(clusterMean(currentCluster)) File "/home/aclab/apps/UNAGI/app/unagi.py", line 382, in clusterMean return int(round(totalPosition/totalSites)) ZeroDivisionError: division by zero

Can you please tell me what is wrong .Thanks is advance

JungNicolas commented 3 years ago

Hello, thank you for your submission and sorry for the delay in addressing your problem. It seems that the program was finding clusters of 3' or 5' sites in your data, however, the coverage at these sites (number of reads with a 3' or 5' at that position) was under the threshold set in the settings (min_3dash_threshold / min_5dash_threshold in the conf.ini file, default value 10). This resulted in an empty clusters and a division by 0, which is a bug on our part.

I corrected the bug, so this error should not show up anymore once you get the new version.

However, depending on your data set, you may want to play with the settings and lower some thresholds to get better results.

aman21392 commented 3 years ago

Hi i download the new version but i got this error. Can you please solve this error .

python3 /home/aclab/apps/UNAGI/app/unagi.py -v File "/home/aclab/apps/UNAGI/app/unagi.py", line 318 lastDashSite=dict(currentDashSite) ^ IndentationError: unindent does not match any outer indentation level

Thanks in advance

JungNicolas commented 3 years ago

Hi ! Well, this is awkward... Thanks for notifying me. The error should be fixed now, sorry about the inconvenience.

pkerrwall commented 3 years ago

FYI - I just ran UNAGI and still got the same error as prevous user aman21392. I modified conf.ini by setting min_3dash_threshold=5 & min_5dash_threshold=5 and it worked. So there is still the bug. It might also be good to have some example data (maybe a small portion of the nanopore reads from your paper & 1 of the chromsomes or select cdna). That way the user knows that it is running properly when they get the same results as you.

Command: /home/shared/software/UNAGI/unagi -i ../nanopore/DRR170482.sra.fastq -o DRR170482/ -g /home/shared/db/Scer/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa -V

Error: ... [2021/01/29 - 08:40:55] Intersecting genes start and end positions from the coverage analysis and cuts Traceback (most recent call last): File "/home/shareed/software/UNAGI/app/unagi.py", line 774, in main(sys.argv[1:]) File "/home/shared/software/UNAGI/app/unagi.py", line 163, in main positiveChrList = insertDashCuts(os.path.join(transitionnalOutputPath,config["positive_3dash_file"]), os.path.join(transitionnalOutputPath,config["positive_5dash_file"]), positiveChrList, os.path.join(transitionnalOutputPath,config["positive_genelist_file"]), "+") File "/home/shared/software/UNAGI/app/unagi.py", line 340, in insertDashCuts for cut3 in dashSiteList3[chr]: KeyError: 'I'

JungNicolas commented 3 years ago

Thank you for the notification of your error and for going the extra mile of giving me the specifics on how you could bypass it. It helped me in understanding what was causing it. While checking the 5' and 3' cut sites for each chromosome, there was a chance for no cuts being detected in a whole chromosome. The chromosome was then not listed in the cuts list, causing an error when it was reviewed in order to integrate the new cuts. When you reduced the threshold, you improved the chance for cuts to be detected and since every chromosome now had at least one cut, the error did not happen anymore. The same thing happened with our data. I made it so the chromosomes are not checked if no new cuts are found, which should fix your problem in a proper way, without needing to compromise on sensitivity. Please try it and let me know how it went if you have time.

I also want to include data for testing purposes as you suggested, I'll have to make a selection and get confirmation on what I can add to this project beforehand, though. Please bear with me for a little longer, I'll let you know when I add it.

JungNicolas commented 3 years ago

I finally was able to add a set of test data. I hope this helps !