honzee / RNAseqCNV

R package for large-scale CNV analysis from RNA-seq
MIT License
11 stars 8 forks source link

SNP filtering based on hg38 genome assembly. Non-matching genome assemblies causing missing MAF grapgs. #6

Closed honzee closed 3 years ago

honzee commented 4 years ago

From JAYRJPT:

Dear Jan, Thanks for the update. I am now able to run the package. However I like to inform you that in the final figure lower panel of the graph( MAF frequency) is mising. Is this the issue with my file or with the package? I have sent you the same file. Did you also faced the same issue? I have also got the following warning-

Warning messages: 1: In dir.create(path = chr_dir) : '/media/deepak/Deepak4T/NEW_SAMPLE/SAMPLE_5/RNASEQCNV/RNASEQCNV/output_dir/BALL5' already exists 2: Graphs cannot be vertically aligned unless the axis parameter is set. Placing graphs unaligned. 3: In max(peak_max) : no non-missing arguments to max; returning -Inf 4: Position guide is perpendicular to the intended axis. Did you mean to specify a different guide position?

Regards, Jay

honzee commented 4 years ago

Dear Jay,

I have the same results with your vcf file. I have looked into it a bit.

What RNAseqCNV does is that it filters the detected SNVs based on their presence in dbSNP database at one point. For your file, this excludes the vast majority of SNVs and thus no MAF density plots are drawn.

I think this may be due to different reference genome versions. Do you know which genome reference did you use for generating this vcf file?

Thank you, Jan

JAYRJPT commented 4 years ago

Dear Jan, I have used hg38 for generating the vcf file.

honzee commented 4 years ago

Dear Jay,

I am sorry that I have not clearly stated, that this package is built upon GRch37. Therefore, for optimal analysis, the reference used should be GRch37. That said, there are a few options to go around this.

First, you can download your own dbSNP database version for hg38 and use this as a vector input for RNAseqCNV wrapper function. The format should look like this:

head(keepSNP) [1] "1-11012" "1-13110" "1-13116" "1-13118" "1-13273" "1-14464" RNAseqCNV_wrapper(config = "C:/Users/xaxa/config.txt", metadata = "C:/Users/xaxa/metadata", snv_format = "vcf", keptSNP = keepSNP)

Secondly, I updated the RNAseqCNV again in a way that it accepts argument: keptSNP = FALSE. In this way filtering through dbSNP database is avoided. Unfortunately, the quality of SNVs goes down significantly. To compensate for this, you should probably increase the minimum depth of SNVs that are accepted and lower the accepted MAF range. Use approximately this command after updating RNAseqCNV:

RNAseqCNV_wrapper(config = "C:/Users/xaxa/config.txt", metadata = "C:/Users/xaxa/metadata", snv_format = "vcf", keptSNP = FALSE, mafRange = c(0.25, 0.75), minDepth = 100)

I tried this for your file and it produced results, that could be worked with. However, the accuracy of prediction by RNAseqCNV will be lower and the quality of MAF density graphs will suffer also.

Anyway, looking forward, I am planning to implement also hg38.

Thank you for your patience Jay and I hope that RNAseqCNV will be helpful in some way for you. If you have any other questions, please do not hesitate to ask.

Best, Jan

JAYRJPT commented 4 years ago

Dear Jan, Thanks for your explanation but I didn't understood this-

First, you can download your own dbSNP database version for hg38 and use this as a vector input for RNAseqCNV wrapper function. The format should look like this:

    head(keepSNP)
    [1] "1-11012" "1-13110" "1-13116" "1-13118" "1-13273" "1-14464"
    RNAseqCNV_wrapper(config = "C:/Users/xaxa/config.txt", metadata = "C:/Users/xaxa/metadata", snv_format = "vcf", keptSNP = keepSNP)

Can you explain it further?

Thanks Jay

honzee commented 4 years ago

Yes. The RNAseqCNV_wrapper function has a parameter called keptSNP. This parameter accepts a character vector of SNP identifiers e.g.: "1-11012" "1-13110" "1-13116"..

The SNPs have different identifiers based on the genome version. You can download the hg38 dbSNP database through the instructions here: http://genome.ucsc.edu/FAQ/FAQdownloads.html#snp And subsequently, extract the SNPs into a vector for input for RNAseqCNV_wrapper.

I hope this helps. Best, Jan

JAYRJPT commented 4 years ago

Dear Jan Sorry for a late reply. I will download the hg38 dbSNP database and let you know if I get any issue.

Thanks, Jay

honzee commented 3 years ago

The issue is being solved in: https://github.com/honzee/RNAseqCNV/issues/7#issue-792763374