honzee / RNAseqCNV

R package for large-scale CNV analysis from RNA-seq
MIT License
11 stars 8 forks source link

Hg38 built for RNAseqCNV #7

Closed JAYRJPT closed 3 years ago

JAYRJPT commented 3 years ago

Hello Jan, I was acquiring VCF files by running GATK pipeline for the input in RNAseqCNV analysis. I have found that GATK pipeline no longer supporting GRch37 as reference genome for generating VCF files hence they also removed GRch37 based dbSNP data from there resource bundle. Therefor I would like to know whether RNAseqCNV package is updated based on hg38 built or it is still compatible with hg19 only. Thanks and Regards, Jay

honzee commented 3 years ago

Dear Jay,

I am glad to hear from you again. RNAseqCNV has in-built reference data (including SNPs), which are unfortunately still hg19 only.

That said, you can supply hg38 data and the updated SNPs to the RNAseqCNV_wrapper with: RNAseqCNV_wrapper(config = "path/to/config", metadata = "path/to/metadata", snv_format = "vcf", referData = "new_reference_data", keptSNP = "new_dbSNP_data") To see if the supplied data is in the correct format, you can check it with the in-build reference through: RNAseqCNV:::refDataExp RNAseqCNV:::keepSNP

I am planning to add hg38 reference but this will take a few weeks.

Hope this helps a bit. Jan ---------- Původní e-mail ----------

JAYRJPT commented 3 years ago

Hello Jan, I followed what you have recommended. I have used my own hg38 data into the RNAseq Wrapper. Here is my data in the prescribed format

head(new_reference_data)
  chr start   end            ENSG
1   1 11869 14409 ENSG00000223972
2   1 11869 14409 ENSG00000223972
3   1 11869 12227 ENSG00000223972
4   1 12613 12721 ENSG00000223972
5   1 13221 14409 ENSG00000223972
6   1 12010 13670 ENSG00000223972
head(new_dbSNP_data)
[1] "1-10001" "1-10002" "1-10003" "1-10008" "1-10009" "1-10015"

However, I got some error while running the package. Here is my error log

 RNAseqCNV_wrapper(config = "/media/deepak/jay_doc/contig.txt", metadata = "/media/deepak/jay_doc/META.csv", snv_format = "vcf", referData = "new_reference_data", keptSNP = "new_dbSNP_data")
[1] "Analysis initiated"
[1] "Normalization for sample: B1 completed"
Error in UseMethod("select_") : 
  no applicable method for 'select_' applied to an object of class "character"
In addition: Warning message:
`select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead. RNAseqCNV_wrapper(config = "/media/deepak/jay_doc/contig.txt", metadata = "/media/deepak/jay_doc/META.csv", snv_format = "vcf", referData = "new_reference_data", keptSNP = "new_dbSNP_data")
[1] "Analysis initiated"
[1] "Normalization for sample: B1 completed"
Error in UseMethod("select_") : 
  no applicable method for 'select_' applied to an object of class "character"
In addition: Warning message:
`select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 

Can you help in resolving the issue?

Thanks and Regards, JAY

honzee commented 3 years ago

Dear Jay,

I am glad you managed to obtain the hg38 data and updated dbSNP in the necessary format. It seems to me, that everything is ok there.

That said, in the function call, there is a wrong input for parameters referData and keptSNP. In the above example the input was: referData = "new_reference_data", keptSNP = "new_dbSNP_data" This means the input is a character string, not the variable storing the hg38 reference and the new dbSNP data. Try to use: RNAseqCNV_wrapper(config = "/media/deepak/jay_doc/contig.txt", metadata = "/media/deepak/jay_doc/META.csv", snv_format = "vcf", referData = new_reference_data, keptSNP = new_dbSNP_data)

If this was not the issue, or something else comes up, please let me know, I hope we can solve it together.

Additionaly, if we get this to work, I would be glad, if we could use the hg38 data and dbSNP database you obtained to update the RNAseqCNV package. Do you think this would be possible?

Best, Jan

JAYRJPT commented 3 years ago

Hello Jan, Thanks for the troubleshooting. I am able to run the package with no noticeable error. But I have got some warnings. Here is the full log-

RNAseqCNV_wrapper(config = "/media/deepak/jay_doc/config.txt", metadata = "/media/deepak/jay_doc/META.csv", snv_format = "vcf", referData = new_reference_data, keptSNP = new_dbSNP_data)

[1] "Analysis initiated" [1] "Normalization for sample: B1 completed" [1] "Preparing file with snv information for: B1" Reading in vcf file.. Extracting depth.. Extracting reference allele and alternative allele depths.. Needed information from vcf extracted Finished reading vcf [1] "Estimating chromosome arm CNV: B1" [1] "Plotting main figure: B1" [1] "Plotting arm-level figures: B1" [1] "Plotting chr 1 arm-level figure" [1] "Plotting chr 2 arm-level figure" [1] "Plotting chr 3 arm-level figure" [1] "Plotting chr 4 arm-level figure" [1] "Plotting chr 5 arm-level figure" [1] "Plotting chr 6 arm-level figure" [1] "Plotting chr 7 arm-level figure" [1] "Plotting chr 8 arm-level figure" [1] "Plotting chr 9 arm-level figure" [1] "Plotting chr 10 arm-level figure" [1] "Plotting chr 11 arm-level figure" [1] "Plotting chr 12 arm-level figure" [1] "Plotting chr 13 arm-level figure" [1] "Plotting chr 14 arm-level figure" [1] "Plotting chr 15 arm-level figure" [1] "Plotting chr 16 arm-level figure" [1] "Plotting chr 17 arm-level figure" [1] "Plotting chr 18 arm-level figure" [1] "Plotting chr 19 arm-level figure" [1] "Plotting chr 20 arm-level figure" [1] "Plotting chr 21 arm-level figure" [1] "Plotting chr 22 arm-level figure" [1] "Plotting chr X arm-level figure" [1] "Analysis for sample: B1 finished"

Warning messages: 1: Computation failed in stat_smooth(): workspace required (3060707942) is too large probably because of setting 'se = TRUE'. 2: Computation failed in stat_smooth(): workspace required (2805011280) is too large probably because of setting 'se = TRUE'. 3: Computation failed in stat_smooth(): workspace required (2427246928) is too large probably because of setting 'se = TRUE'. 4: Computation failed in stat_smooth(): workspace required (3286495030) is too large probably because of setting 'se = TRUE'.

Yes, If it works we should update the package.

honzee commented 3 years ago

It seems like there are some problems with the drawing of the confidence intervals for the chromosome-specific graphs. I tried to solve this in the latest version, you can try to update the package and see if it helps with the warning messages.

Other than that, is the new reference and RNAseqCNV overall working for you?

Jan

JAYRJPT commented 3 years ago

I have attached the main figure generated from the package. I dont know why some of the chromososmes showing low quality and 3 of them showing "ab" instead of some numeric value. It will be appreciated if you can look into the figure and kindly share your opinion.

B1_CNV_main_fig

honzee commented 3 years ago

For sake of clarity, I have moved the issue of figure interpretation into a different thread: https://github.com/honzee/RNAseqCNV/issues/8#issue-844135321

We can, however, continue the discussion on the hg38 build here.

Best, Jan

JAYRJPT commented 3 years ago

It seems like there are some problems with the drawing of the confidence intervals for the chromosome-specific graphs. I tried to solve this in the latest version, you can try to update the package and see if it helps with the warning messages.

Other than that, is the new reference and RNAseqCNV overall working for you?

Jan

So I guess the new reference is working fine. As I have no idea how to update the package as you have mentioned therefore it will be difficult for me to do so. Can I send you the reference file so that you can update the package or is there any other way to do so?

honzee commented 3 years ago

You can update the package by downloading it again by following the instructions in the README for installation.

Yes, that would be very kind of you. Before I can implement it in the package, I need to have a bit more information about the source of the data. Could you please inform me, how the two files were acquired? Thank you, Jan

JAYRJPT commented 3 years ago
  1. GTF file (gencode.v36.annotation.gtf) is used to extract the new_reference_data of Hg38. gencode.v36.annotation.gtf is downloaded from GENCODE (https://www.gencodegenes.org/human)
  2. new_dbSNP_data is obtained from NCBI snp database( https://ftp.ncbi.nih.gov/snp/archive/b153/VCF). GCF_000001405.25.gz is used to extract the reference snp.

Thank You, Jay

honzee commented 3 years ago

Dear Jay,

thank you for the links. However, I am not sure we have a matching gtf file version and dbSNP database version.

  1. The GTF file is genome version: GRCh38.p13

  2. The dbSNP databese is version: GCF_000001405.25, which according to: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/ corresponds with GRCh37.p13. The latest version of dbSNP database: GCF_000001405.38 should correspond to GRCh38.p12 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.38/).

If I am missing something, or I am incorrect, please, let me know. I am just trying to make sure, that everything checks out.

Best, Jan

JAYRJPT commented 3 years ago

Hello Jan, Sorry for the late response. Yes you are correct GCF_000001405.38 corresponds to GRCh38.p12. It seems I have downloaded the wrong dbSNP database.

Therefore to update the package for hg38 , I should download GCF_000001405.38 and the cooresponding GTF file of the Genome version GRCh38.p12.

Thanks

honzee commented 3 years ago

Hello Jay,

yes, that should be the way.

Best, Jan

honzee commented 3 years ago

Dear Jay,

the hg18 has been implemented into RNAseqCNV. Use the parameter genome_version to choose the desired genome version. The possible options are: "GRCh38" or "hg19".

With this implementation, I will be closing this issue.

Best, Jan