virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
131 stars 41 forks source link

A problem about "ValueError: cannot convert float NaN to integer" #92

Open tanzhengtang opened 3 years ago

tanzhengtang commented 3 years ago

Hello,I meet a question when I run AA for getting data.

The AmpliconArchitect version is 1.2

my command: python /home/tang/tools/AmpliconArchitect/src/AmpliconArchitect.py --bed ../NCI-H716_AA_CNV_SEEDS.bed --bam ../NCI-H716_hg38_align_sort_rmdup.bam --downsample 0 --out NCI-H716_ecDNA --ref GRCh38

The process: [root:INFO] #TIME 6.673 Loading libraries and reference annotations for: GRCh38 Global ref name is GRCh38 [root:INFO] #TIME 16.012 Initiating bam_to_breakpoint object for: ../NCI-H716_hg38_align_sort_rmdup.bam [root:INFO] #TIME 16.013 Exploring interval: chr10 121465452 121845452 Traceback (most recent call last): File "/home/tang/tools/AmpliconArchitect/src/AmpliconArchitect.py", line 214, in ilist = bamFileb2b.interval_hops(ird, rdlist=all_ilist) File "/home/tang/tools/AmpliconArchitect/src/bam_to_breakpoint.py", line 1675, in interval_hops ii = self.interval_extend(i2) File "/home/tang/tools/AmpliconArchitect/src/bam_to_breakpoint.py", line 1779, in interval_extend elif self.interval_amplified(hg.interval(ic.chrom, ic.end, ic.end + right_size ms_window_size), filter_small=False): File "/home/tang/tools/AmpliconArchitect/src/bam_to_breakpoint.py", line 1746, in interval_amplified elif filter_small == False and i.size() < 2 ms_window_size and len(self.interval_discordant_edges(i)) >= 2: File "/home/tang/tools/AmpliconArchitect/src/bam_to_breakpoint.py", line 1189, in interval_discordant_edges mcdflist.extend(hgl.merge_clusters(extend=self.max_insert - self.read_length)) File "/home/tang/tools/AmpliconArchitect/src/hg19util.py", line 470, in merge_clusters ci = ci.merge(a, extend) File "/home/tang/tools/AmpliconArchitect/src/hg19util.py", line 305, in merge if not self.intersects(y, extend): File "/home/tang/tools/AmpliconArchitect/src/hg19util.py", line 289, in intersects if (int(a[1])-int(b[1]))*(int(a[2])-int(b[1])) <= 0: ValueError: cannot convert float NaN to integer.

What should I do to solve this question? Thanks!

jluebeck commented 3 years ago

Hi,

I have not seen this error before. Could you please upload any log files produced by AA. Could you also clarify which aligner was used to generate the BAM file?

Thanks, Jens

tanzhengtang commented 3 years ago

I used bwa 0.7.17

and all the log content:

INFO:root:Commandline: /home/tang/tools/AmpliconArchitect/src/AmpliconArchitect.py --bed ../NCI-H716_AA_CNV_SEEDS.bed --bam ../NCI-H716_hg38_align_sort_rmdup.bam --downsample 0 --out NCI-H716_ecDNA --ref GRCh38
INFO:root:AmpliconArchitect version 1.2 INFO:root:#TIME 7.035 Loading libraries and reference annotations for: GRCh38 INFO:root:#TIME 16.572 Initiating bam_to_breakpoint object for: ../NCI-H716_hg38_align_sort_rmdup.bam INFO:root:#TIME 16.573 Exploring interval: chr10 121465452 121845452 DEBUG:root:#TIME 16.595 interval_hops: init chr10 121465452 121845452 DEBUG:root:#TIME 52.786 discordant edges chr10 121845452 121855452 DEBUG:root:#TIME 52.943 discordant edges: fetch discordant chr10 121845452 121855452 942 702 DEBUG:root:#TIME 52.943 discordant edges: discordant read pairs found: chr10 121845452 121855452 942 702 INFO:root:Commandline: /home/tang/tools/AmpliconArchitect/src/AmpliconArchitect.py --bed ../NCI-H716_AA_CNV_SEEDS.bed --bam ../NCI-H716_hg38_align_sort.bam --downsample 0 --out NCI-H716_ecDNA --ref GRCh38
INFO:root:AmpliconArchitect version 1.2 INFO:root:#TIME 6.915 Loading libraries and reference annotations for: GRCh38 INFO:root:#TIME 16.398 Initiating bam_to_breakpoint object for: ../NCI-H716_hg38_align_sort.bam INFO:root:#TIME 60.490 Exploring interval: chr10 121465452 121845452 DEBUG:root:#TIME 60.511 interval_hops: init chr10 121465452 121845452 DEBUG:root:#TIME 103.041 discordant edges chr10 121845452 121855452 DEBUG:root:#TIME 103.337 discordant edges: fetch discordant chr10 121845452 121855452 1404 1084 DEBUG:root:#TIME 103.337 discordant edges: discordant read pairs found: chr10 121845452 121855452 1404 1084

jluebeck commented 3 years ago

Hi,

I believe the issue is that somehow NaN is assigned for the mean read_length or max_insert size. Can you please try the following.

  1. Always remove temporary results from the output folder before re-running AA in same folder. It appears AA was run twice in the same folder with different BAM files.
  2. Please clear the contents of the file "coverage.stats" from your AA_DATA_REPO. Do not delete the file, but please clear its contents.
  3. Is the BAM file coordinate sorted?

Best, Jens

tanzhengtang commented 3 years ago

According your advice,I checked my bam file,and found a mistake that I used double R1.fq file when I aligned.

I'm a newbie in this field,I have read a bit of literature in related fields,and I found that the complete sequence of ecDNA is not given in those literature.May I ask question: Is there any way to accurately get the sequence of ecDNA? If yes,how to get the data about complete sequence of ecDNA?

Thank you very much! ZhengTang

jluebeck commented 3 years ago

Hi ZhengTang,

No problem. With NGS data alone, we can only make bioinformatic predictions of ecDNA regions given breakpoint graphs. The confidence of those predictions increases when FISH data or other cytogenetics are also present to validate. Lastly, long-range sequencing can aid in validating the structures suggested by NGS, or in some cases can be used on their own to derive a structure. The task of identifying if ecDNA is present and the problem of resolving its structure are actually quite different. Detecting ecDNA presence from NGS alone is an easier problem than resolving the complete sequence of the ecDNA amplicon.

If you would like to make predictions of ecDNA regions based on your NGS data, please see AmpliconClassifier.

Best, Jens

tanzhengtang commented 3 years ago

I get it. But I’m curious if there is an experimental method to isolate ecDNA for analysis?In this way if we can get the complete structure of ecDNA? Thank you very much!

jluebeck commented 3 years ago

Hi,

Some groups have had success isolating ecDNA using the Circle-Seq technique, for example in the following publications: https://www.nature.com/articles/s41588-019-0547-z https://www.nature.com/articles/s41588-020-0678-2

Best regards, Jens