gladkia / igvR

An R Bioconductor package providing interactive connections to igv.js (the Integrative Genomics Viewer) in a web browser
MIT License
42 stars 4 forks source link

smaller vcf sample files #34

Closed paul-shannon closed 1 year ago

paul-shannon commented 1 year ago

@gladkia - Hi Arek,

I just filtered the chr22 AMPAD vcf file, creating chr22-sub, at 0.16% of the original size.
I hope that this 15M file is a good fit to your new hosting at gladkia.pl

Sample code below. Do you need more from me along these lines? Glad to provide it if so.

   15515834 Oct 12 08:28 chr22-sub.vcf.bgz
        228 Oct 12 08:29 chr22-sub.vcf.bgz.tbi
 9630339397 Jul 16  2021 chr22.vcf.gz
      35986 Jul 16  2021 chr22.vcf.gz.tbi

The full chr22 file, though the smallest of the chromosomes, is so large because there are many samples. Here is a some minimal code to display this file:

library(igvR)
igv <- igvR()
setGenome(igv, "hg19")
url <- "https://igv-data.systemsbiology.net/ampad/NIA-1898/chr22-sub.vcf.bgz"
indexUrl <- "https://igv-data.systemsbiology.net/ampad/NIA-1898/chr22-sub.vcf.bgz.tbi"
vcf=list(data=url, index=indexUrl)
chrom <- "22"
start.loc <- 50586118
end.loc   <- 50633733
roi <- GRanges(seqnames=chrom, IRanges(start=start.loc, end=end.loc))
showGenomicRegion(igv, sprintf("%s:%d-%d", chrom, start.loc, end.loc))
track <- VariantTrack("chr22-sub", vcf,
                      displayMode="COLLAPSED",
                      visibilityWindow=10^6)
displayTrack(igv, track)
gladkia commented 1 year ago

Hi @paul-shannon,

The subsetting solution looks great :). I will update the chr22 to use chr22-sub :).

I've checked all the files that used to be fetched from http://igv-data.systemsbiology.net. Please see the list below with each row beginning with the status and followed by the URL of the given file

[NOT FOUND] https://igv-data.systemsbiology.net/static/bamtests/x.bam
[COPIED] https://igv-data.systemsbiology.net/static/testFiles/DNase.bam
[COPIED] https://igv-data.systemsbiology.net/static/testFiles/ndufs2-hg38-simple.bed.gz
[COPIED] https://igv-data.systemsbiology.net/static/testFiles/wgEncodeBroadHistoneGm12878H3k4me3StdSig.bigWig
[COPIED] https://igv-data.systemsbiology.net/static/testFiles/ndufs2-hg38-simple2.bed.gz
[COPIED] https://igv-data.systemsbiology.net/testFiles/GRCh38.94.NDUFS2.gff3
[COPIED] https://igv-data.systemsbiology.net/misc/Homo_sapiens.GRCh38.94.chr.gff3.gz
[COPIED] https://igv-data.systemsbiology.net/misc/Homo_sapiens.GRCh38.94.chr.gff3.gz.tbi
[COPIED] https://igv-data.systemsbiology.net/testFiles/gwas/bellenguez.gwas
[COPIED] https://igv-data.systemsbiology.net/testFiles/gwas/bellenguez.bed
[COPIED] https://igv-data.systemsbiology.net/testFiles/gwas/carolin.gwas
[COPIED] https://igv-data.systemsbiology.net/testFiles/gwas/gwas_sample_tiny.tsv
[COPIED] https://igv-data.systemsbiology.net/testFiles/gwas/tbl.gwas.yeast.chrV.tsv
[COPIED] https://igv-data.systemsbiology.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.fai
[COPIED] https://igv-data.systemsbiology.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
[COPIED] https://igv-data.systemsbiology.net/tair10/TAIR10_genes.sorted.chrLowered.gff3.gz
[COPIED] https://igv-data.systemsbiology.net/Pfalciparum3D7/PlasmoDB-43_Pfalciparum3D7_Genome.fasta
[COPIED] https://igv-data.systemsbiology.net/Pfalciparum3D7/PlasmoDB-43_Pfalciparum3D7_Genome.fasta.fai
[COPIED] https://igv-data.systemsbiology.net/Pfalciparum3D7/PlasmoDB-43_Pfalciparum3D7.gff
[COPIED] http://igv-data.systemsbiology.net/static/rhos/GCF_000012905.2_ASM1290v2_genomic.fna.fai
[COPIED] https://igv-data.systemsbiology.net/static/tmp/chr19sub.bed
[TOO LARGE -36G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7.vcf.gz
[TOO LARGE - 51G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2.vcf.gz

As you can see I've managed to copy most of them. I will need your help with:

Once we solve all these issues we should be able to merge: https://github.com/gladkia/igvR/pull/35.

gladkia commented 1 year ago

@paul-shannon regarding using subset of VCF for chr22 (ampad/NIA-1898): it's already done [link].

paul-shannon commented 1 year ago

Hi Arek,

x.bam is now at https://igv-data.systemsbiology.net/bamtests/x.bam

Note that “static” subdirectory is no longer in the url.

As for trimming chr7 and chr2

[TOO LARGE -36G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7.vcf.gz [TOO LARGE - 51G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2.vcf.gz

I need to know the small region of interest in each file. Is that handy for you to find out?

As you can see I've managed to copy most of them. I will need your help with: • two large VCF files. It would be great to prepare the subset the same way as for chr22 • one file with not working link: https://igv-data.systemsbiology.net/static/bamtests/x.bam

gladkia commented 1 year ago

Hi Paul,

Hi Arek, x.bam is now at https://igv-data.systemsbiology.net/bamtests/x.bam

Awesome. Fixed.

As for trimming chr7 and chr2 [TOO LARGE -36G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7.vcf.gz [TOO LARGE - 51G] https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2.vcf.gz I need to know the small region of interest in each file. Is that handy for you to find out?

For chr7 100,330,000-100,340,000 should suffice (https://github.com/gladkia/igvR/blob/master/inst/demos/vcfDemo.R#L17-L21). For chr2 maybe 1,099,000-1,104,000 (https://github.com/gladkia/igvR/blob/master/inst/demos/vcfDemo.R#L45-L49)?

Arek

paul-shannon commented 1 year ago

@gladkia - Hi Arek,

I think these new smaller files give you what you asked for:

                                                            url                 size in bytes
 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2-sub.vcf                11182327 
 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2-sub.vcf.bgz            2153591 
 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr2-sub.vcf.bgz.tbi        110 

 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7-sub.vcf                16107528 
 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7-sub.vcf.bgz            2551332 
 https://igv-data.systemsbiology.net/ampad/NIA-1898/chr7-sub.vcf.bgz.tbi        226 
gladkia commented 1 year ago

Thanks, @paul-shannon!

I've copied files and updated the URLs here.