honzee / RNAseqCNV

R package for large-scale CNV analysis from RNA-seq
MIT License
9 stars 8 forks source link

CNV prediction failed #12

Open Kaddea opened 2 years ago

Kaddea commented 2 years ago

Hi, I've encountered a sample where the CNV prediction throws an error. All other samples were processed fine. I couldn't find any (obvious) problems within the counts or the VCF file. The R-output is attached below, maybe this gives already a clue ...

Thanks, Mathias

[1] "Normalization for sample: test_fail completed" [1] "Preparing file with snv information for: test_fail" Reading in vcf file.. Extracting depth.. Extracting reference allele and alternative allele depths.. Needed information from vcf extracted Finished reading vcf [1] "Estimating chromosome arm CNV: test_fail" Error in seq.default(from = 1, to = floor(yAxisMax)) : 'to' must be a finite number In addition: Warning messages: 1: In dir.create(path = chr_dir) : 'test_output/test_pass' already exists 2: Removed 10 rows containing missing values (geom_smooth). 3: Removed 4 rows containing missing values (geom_smooth). 4: In dir.create(path = chr_dir) : 'test_output/test_fail' already exists 5: In max(.) : no non-missing arguments to max; returning -Inf

honzee commented 2 years ago

Hi Mathias,

thank you for forwarding the error! It is very likely, that there is an issue with the specific VCF file. Specifically, I think there is a low number of valid SNPs for the analysis on a certain chromosome.

To test this hypothesis, you could try to list the number of SNPs for each chromosome in that VCF file and compare the number of SNPs with a working VCF file. This way we would know, that there are fewer SNPs than expected.

You could also forward the VCF file to me if you would feel comfortable with that.

To give you some context: In RNAseqCNV, there is a filter for low coverage of SNPs on all chromosomes combined, but it is missing for each chromosome separately. If this would cause issues, I would look into it.

Best, Jan

Kaddea commented 2 years ago

Hi Jan,

thanks for the immediate support :)

I've listed the number of variants of the "failed" and a "passed" vcf (numbers.csv), but you can also download the more interesting vcf file (test_fail.vcf.gz) from: https://ct1130.pm11-host.1awww.com/s/qmkqWpMco7fmFmn (my personal NextCloudHub).

Both samples were prepared, sequenced and analyzed in the same batch and the quality tests were good for both of them.

When your assumption about too few variants for critical genomic regions is true, it might not help to recalibrate the gene weights, right? Or does the number of variants used to predict the CNV status depend on the genes used to generate gene weights?

best, Mathias

Am 2022-04-07 20:27, schrieb honzee:

Hi Mathias,

thank you for forwarding the error! It is very likely, that there is an issue with the specific VCF file. Specifically, I think there is a low number of valid SNPs for the analysis on a certain chromosome.

To test this hypothesis, you could try to list the number of SNPs for each chromosome in that VCF file and compare the number of SNPs with a working VCF file. This way we would know, that there are fewer SNPs than expected.

You could also forward the VCF file to me if you would feel comfortable with that.

To give you some context: In RNAseqCNV, there is a filter for low coverage of SNPs on all chromosomes combined, but it is missing for each chromosome separately. If this would cause issues, I would look into it.

Best, Jan

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you authored the thread.Message ID: @.***>

Links:

[1] https://github.com/honzee/RNAseqCNV/issues/12#issuecomment-1092070022 [2] https://github.com/notifications/unsubscribe-auth/AEMPXPXJ4ZR7Q4CVZ6CRO5LVD4SJJANCNFSM5SZX6K5A variants test_fail test_pass total 387361 356064 1 40137 32418 2 26136 25694 3 23742 19585 4 14834 12848 5 17053 15724 6 19173 17163 7 23887 18911 8 14561 12501 9 15396 13884 10 13597 15442 11 15867 19560 12 20555 20469 13 11157 6642 14 11233 11028 15 12030 9682 16 17153 16447 17 18415 20964 18 6540 6027 19 27966 26011 20 11718 10281 21 6039 4771 22 9177 10161 X 10751 9444 Y 149 393 M 49 36

honzee commented 2 years ago

Hi Mathias,

so the number of SNPs you showed looked completely fine, so I downloaded the vcf file and looked into it.

The issue is, that too many of these SNPs get filtered out. There are multiple SNP filters we use - sequencing depth, MAF range or dbSNP database. The main issue seems to lie within the dbSNP database filter after comparing the results with a validated vcf file. We use the filter to ensure the reliability of the detected variants.

So first, I would recommend checking if there is something out of ordinary for this sample compared to the others. I am not sure what might be causing the significantly lower overlap with the dbSNP database.

Second, I would like to make this part of the analysis more transparent. Such as: putting out warning/error messages and allowing to set the minimum number of SNPs to move forward with the analysis.

Best, Jan