Closed JustGitting closed 7 years ago
Hello, I don't think we're at a point where you would need to re-compile yet. There are a number of steps you've provided, but I don't feel we can confidently determine at what point we've gone awry. Let's see if we can add some sanity checks in here to validate that the steps you've taken result in outputs that seem reasonable. Please try the following.
library(vcfR)
vcf_file <- "big.vcf"
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
The last line (simply the object name) will report some information such as how many samples and variants are in the data. Could you copy and paste this info into a response?
dp <- extract.gt(chrom, element="DP", as.numeric=TRUE)
mids <- apply(dp, MARGIN=2, median, na.rm=TRUE)
mids
Again, the last line should give us some information about the data. This should be a vector the same length as the number of samples in your data. Could you copy and paste this info into a response? If you have a lot of samples you could use dput()
to create some code I can copy and paste locally.
dp2 <- sweep(dp, MARGIN=2, mids, FUN="-")
dp2 <- abs(dp2)
dp2 <- -1 * dp2
amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
amins
This is where you report your first issue. What does amins
look like? Can you copy and paste that into a response? Again, if you have a lot of samples you could use dput()
to create something I can copy and paste locally.
Thanks! Brian
Hi Brian,
Thanks for your reply, here are the answers to your questions.
library(vcfR)
vcf_file <- "big.vcf"
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
OUTPUT
> vcf
***** Object of Class vcfR *****
307 samples
4 CHROMs
3,222,628 variants
Object size: 8847.4 Mb
97.79 percent missing data
***** ***** *****
dp <- extract.gt(chrom, element="DP", as.numeric=TRUE)
mids <- apply(dp, MARGIN=2, median, na.rm=TRUE)
dput(mids)
OUTPUT
> dput(mids)
structure(c(4, 4, 2, 5, 9, 8, 6, 8, 2, 2, 5, NA, 8, 4, 9, 5,
2, 3, 2, 3, 6, 2, 11, 5, 2, 2, 2, 4, 3, 5, 9, 2, 2, 2, 2, 2,
5, 3, 7, 2, 4, 8, 4, 2, 9, 11, 2, 3, 5, 2, 2, 2, 2, 5, 12, 4,
2, 2, 2, 2, 6, 13, 2, 10, 2, 5, 2, 7, 6, 2, 5, 2, 14, 6, 7, 2,
2, 5, 7, 2, 12, 4, 7, 2, 7, 7, 2, 11, 7, 2, 6, 10, 2, 2, 2, 2,
9, 10, 3, 5, 8, 6, 2, 8, 5, 7, 2, 12, 8, 2, 2, 3, 7, 2, 2, 4,
2, 7, 11, 6, 2, 2, 2, 2, 5, 2, 10, 2, 6, 8, 2, 5, 2, 4, 2, 5,
7, 6, 9, 4, 2, 5, 6, 2, 2, 8, 3, 2, 3, 9, 5, 13, 14, 10, 6, 10,
8, 6, 3, 6, 3, 7, 8, 8, 9, 6, 4, 5, 2, 4, 10, 7, 5, 3, 9, 8,
2, 3, 2, 5, 4, 7, 3, 2, 6, 2, 2, 2, 8, 9, 6, 2, 2, 2, 2, 2, 6,
2, 6, 2, 6, 11, 5, 4, 6, 7, 4, 5, 2, 7, 6, 8, 4, 11, 2, 5, 8,
4, 2, 6, 3, 4, 2, 7, 5, 8, 2, 2, 2, 8, 5, 9, 6, 3, 4, 2, 9, 6,
4, 6, 4, 2, 8, 9, 3, 3, 2, 9, 2, 6, 2, 4, 8, 2, 2, 6, 5, 8, 4,
5, 8, 2, 4, 13, 10, 10, 8, 2, 7, 8, 6, 8, 5, 7, 8, 10, 13, 3,
2, 5, 9, 6, 4, 10, 4, 6, 9, 4, 3, 8, 2, 4, 2, 7, 3, 2, 6, 5,
4, 7, 11, 2, 10, 6, 8, 2, 2), .Names = c("S206.variant",
"S84A.variant2", "S186.variant3", "S86A.variant4",
"S139.variant5", "S268.variant6", "S205.variant7",
"S215.variant8", "S230.variant9", "S232.variant10",
"S85A.variant11", "S231.variant12",
"S239.variant13", "S248.variant14", "S217.variant15",
"S222.variant16", "S176.variant17", "S40A.variant18",
"S85A.variant19", "S223.variant20", "S76A.variant21",
"S249.variant22", "S214.variant23", "S41A.variant24",
"S178.variant25", "S99A.variant26",
"S214.variant27", "S86A.variant28", "S253.variant29",
"S44A.variant30", "S87A.variant31", "S162.variant32",
"S43A.variant33", "S97A.variant34", "S260.variant35",
"S88A.variant36", "S112.variant37", "S221.variant38",
"S42A.variant39", "S60A.variant40", "S89A.variant41",
"S106.variant42", "S257.variant43", "S212.variant44",
"S143.variant45", "S152.variant46", "S109.variant47",
"S250.variant48", "S110.variant49", "S221.variant50",
"S258.variant51", "S161.variant52", "S92A.variant53",
"S218.variant54", "S96A.variant55", "S177.variant56",
"S70A.variant57", "S94A.variant58", "S142.variant59",
"S211.variant60", "S90A.variant61", "S91A.variant62",
"S262.variant63", "S92A.variant64", "S227.variant65",
"S93A.variant66", "S216.variant67", "S259.variant68",
"S94A.variant69", "S222.variant70", "S95A.variant71",
"S215.variant72", "S226.variant73", "S225.variant74",
"S96A.variant75", "S100.variant76", "S256.variant77",
"S97A.variant78", "S267.variant79", "S146.variant80",
"S98A.variant81", "S150.variant82", "S72A.variant83",
"S244.variant84", "S99A.variant85", "S208.variant86",
"S217.variant87", "ST232.variant88", "S100.variant89",
"S223.variant90", "S48A.variant91", "S101.variant92",
"S243.variant93", "S168.variant94", "S210.variant95",
"S84A.variant96", "S269.variant97", "S102.variant98",
"S207.variant99", "S167.variant100", "S77A.variant101",
"S103.variant102", "S173.variant103", "S104.variant104",
"S138.variant105", "S105.variant106", "S213.variant107",
"S106.variant108", "S107.variant109", "S202.variant110",
"S218.variant111", "S89A.variant112", "S108.variant113",
"S224.variant114", "S263.variant115", "S101.variant116",
"S264.variant117", "S156.variant118", "S212.variant119",
"S109.variant120", "S135.variant121", "S219.variant122",
"S49A.variant123", "S228.variant124", "S110.variant125",
"S196.variant126", "S111.variant127", "S112.variant128",
"S50A.variant129", "S219.variant130", "S182.variant131",
"51A.variant132", "S113.variant133", "S52A.variant134",
"S226.variant135", "S246.variant136",
"S145.variant137", "S144.variant138", "S151.variant139",
"S114.variant140", "S245.variant141", "S115.variant142",
"S116.variant143", "S224.variant144", "S104.variant145",
"S119.variant146", "S171.variant147", "S105.variant148",
"S179.variant149", "S117.variant150", "S118.variant151",
"S220.variant152", "S229.variant153", "S233.variant154",
"S231.variant155", "S78A.variant156", "S149.variant157",
"S114.variant158", "S251.variant159", "S120.variant160",
"S225.variant161", "S121.variant162", "S122.variant163",
"S123.variant164", "S137.variant165", "S124.variant166",
"S241.variant167", "S125.variant168", "S197.variant169",
"S252.variant170", "S126.variant171", "S148.variant172",
"S127.variant173", "S216.variant174", "S155.variant175",
"S128.variant176", "S261.variant177", "S193.variant178",
"S227.variant179", "S88A.variant180", "S187.variant181",
"S80A.variant182", "S54A.variant183",
"S169.variant184", "S55A.variant185", "S93A.variant186",
"S236.variant187", "S56A.variant188", "S210.variant189",
"S265.variant190", "S57A.variant191", "S242.variant192",
"S233.variant193", "S58A.variant194", "S183.variant195",
"S59A.variant196", "S129.variant197", "S240.variant198",
"S130.variant199", "S255.variant200", "S213.variant201",
"S90A.variant202", "S131.variant203", "S136.variant204",
"S74A.variant205", "S132.variant206", "S201.variant207",
"S133.variant208", "S234.variant209", "S165.variant210",
"S195.variant211", "S134.variant212", "S203.variant213",
"S141.variant214", "S185.variant215", "S79A.variant216",
"S150.variant217", "S113.variant218", "S229.variant219",
"S135.variant220", "S209.variant221", "S254.variant222",
"S45A.variant223", "S136.variant224", "S204.variant225",
"S82A.variant226", "S153.variant227",
"S91A.variant228", "S247.variant229",
"S137.variant230", "S211.variant231",
"S75A.variant232", "S228.variant233",
"S103.variant234", "S163.variant235", "S61A.variant236",
"S102.variant237", "S140.variant238", "S192.variant239",
"S81A.variant240", "S138.variant241", "S237.variant242",
"S175.variant243", "S140.variant244", "S62A.variant245",
"S190.variant246", "S47A.variant247", "S139.variant248",
"S220.variant249", "S141.variant250", "S181.variant251",
"S154.variant252", "S142.variant253", "S170.variant254",
"S98A.variant255", "S63A.variant256", "S111.variant257",
"S143.variant258", "S200.variant259", "S160.variant260",
"S83A.variant261", "S188.variant262", "S87A.variant263",
"S166.variant264", "S144.variant265",
"S184.variant266", "S189.variant267", "S64A.variant268",
"S145.variant269", "S159.variant270", "S266.variant271",
"S107.variant272", "S164.variant273", "S146.variant274",
"S230.variant275", "S147.variant276", "S180.variant277",
"S148.variant278", "S65A.variant279",
"S174.variant280", "S73A.variant281",
"S149.variant282", "S108.variant283", "S151.variant284",
"S172.variant285", "S152.variant286", "S153.variant287",
"S154.variant288", "S66A.variant289", "S155.variant290",
"S235.variant291", "S162.variant292", "S67A.variant293",
"S191.variant294", "S156.variant295", "S68A.variant296",
"S157.variant297", "S158.variant298", "S199.variant299",
"S95A.variant300", "S159.variant301", "S238.variant302",
"S194.variant303", "S160.variant304", "S161.variant305",
"S198.variant306", "S69A.variant307"))
dp2 <- sweep(dp, MARGIN=2, mids, FUN="-")
dp2 <- abs(dp2)
dp2 <- -1 * dp2
amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
amins
OUTPUT
> dput(amins)
structure(c(1069, 1381, 1054, 3416, 4928, 3342, 2697, 3963, 1,
6, 1775, Inf, 3466, 1685, 5627, 2309, 450, 1739, 79, 1329, 2962,
524, 6301, 1505, 464, 73, 166, 1775, 907, 3867, 4404, 153, 2032,
59, 801, 1306, 1890, 409, 5064, 37, 2193, 5186, 1837, 282, 4107,
5730, 0, 837, 2284, 154, 713, 430, 67, 2183, 5460, 1931, 578,
11, 510, 270, 2951, 6988, 357, 3856, 496, 2575, 406, 3139, 4861,
191, 3225, 1222, 6587, 2589, 5700, 54, 410, 3344, 4143, 856,
6955, 1562, 5354, 277, 2048, 3231, 87, 4959, 3577, 173, 2890,
5688, 135, 193, 3301, 4, 3964, 6647, 532, 6079, 5387, 1800, 898,
4295, 3275, 4645, 1244, 6314, 5688, 792, 144, 1722, 4414, 174,
207, 1524, 504, 4458, 6208, 2005, 158, 590, 3156, 0, 2259, 90,
6339, 1, 3085, 4062, 337, 1843, 11, 1000, 0, 1874, 5273, 4112,
4303, 1118, 444, 1887, 3942, 740, 102, 4184, 1075, 11, 784, 5032,
2083, 6533, 6759, 5066, 3332, 6053, 3495, 2405, 913, 2797, 532,
3833, 4173, 4071, 4159, 3654, 1217, 2397, 547, 1952, 5504, 5141,
2083, 1724, 4591, 4223, 177, 1107, 1, 2510, 1706, 3715, 1468,
364, 3447, 227, 80, 121, 5017, 5313, 3248, 522, 2188, 107, 92,
232, 2906, 223, 4092, 589, 1793, 5276, 1747, 2341, 6085, 3250,
2392, 3527, 280, 3608, 4375, 3653, 703, 6793, 999, 3335, 7029,
1396, 290, 3642, 937, 1240, 537, 4011, 2297, 5304, 112, 18, 594,
4600, 1970, 4744, 3001, 1070, 3344, 450, 4422, 3687, 1716, 4026,
2002, 151, 4894, 2584, 363, 1551, 135, 6603, 198, 3737, 365,
2642, 4176, 495, 14, 2781, 3043, 4304, 3059, 3191, 4999, 787,
2299, 5532, 6964, 5577, 6136, 149, 3000, 5337, 3670, 4684, 3377,
4070, 3666, 7220, 5657, 636, 66, 2343, 5318, 2958, 2561, 5966,
2093, 3291, 3632, 1366, 1141, 4655, 418, 2417, 80, 4769, 1160,
17, 2927, 2566, 2609, 5100, 6956, 456, 6016, 2356, 3823, 34,
688), .Names = c("S206.variant", "S84A.variant2",
"S186.variant3", "S86A.variant4", "S139.variant5",
"S268.variant6", "S205.variant7", "S215.variant8",
"S230.variant9", "S232.variant10",
"S85A.variant11", "S231.variant12",
"S239.variant13", "S248.variant14", "S217.variant15",
"S222.variant16", "S176.variant17", "S40A.variant18",
"S85A.variant19", "S223.variant20", "S76A.variant21",
"S249.variant22", "S214.variant23", "S41A.variant24",
"S178.variant25", "S99A.variant26",
"S214.variant27", "S86A.variant28", "S253.variant29",
"S44A.variant30", "S87A.variant31", "S162.variant32",
"S43A.variant33", "S97A.variant34", "S260.variant35",
"S88A.variant36", "S112.variant37", "S221.variant38",
"S42A.variant39", "S60A.variant40", "S89A.variant41",
"S106.variant42", "S257.variant43", "S212.variant44",
"S143.variant45", "S152.variant46", "S109.variant47",
"S250.variant48", "S110.variant49", "S221.variant50",
"S258.variant51", "S161.variant52", "S92A.variant53",
"S218.variant54", "S96A.variant55", "S177.variant56",
"S70A.variant57", "S94A.variant58", "S142.variant59",
"S211.variant60", "S90A.variant61", "S91A.variant62",
"S262.variant63", "S92A.variant64", "S227.variant65",
"S93A.variant66", "S216.variant67", "S259.variant68",
"S94A.variant69", "S222.variant70", "S95A.variant71",
"S215.variant72", "S226.variant73", "S225.variant74",
"S96A.variant75", "S100.variant76", "S256.variant77",
"S97A.variant78", "S267.variant79", "S146.variant80",
"S98A.variant81", "S150.variant82", "S72A.variant83",
"S244.variant84", "S99A.variant85", "S208.variant86",
"S217.variant87", "S232.variant88", "S100.variant89",
"S223.variant90", "S48A.variant91", "S101.variant92",
"S243.variant93", "S168.variant94", "S210.variant95",
"S84A.variant96", "S269.variant97", "S102.variant98",
"S207.variant99", "S167.variant100", "S77A.variant101",
"S103.variant102", "S173.variant103", "S104.variant104",
"S138.variant105", "S105.variant106", "S213.variant107",
"S106.variant108", "S107.variant109", "S202.variant110",
"S218.variant111", "S89A.variant112", "S108.variant113",
"S224.variant114", "S263.variant115", "S101.variant116",
"S264.variant117", "S156.variant118", "S212.variant119",
"S109.variant120", "S135.variant121", "S219.variant122",
"S49A.variant123", "S228.variant124", "S110.variant125",
"S196.variant126", "S111.variant127", "S112.variant128",
"S50A.variant129", "S219.variant130", "S182.variant131",
"S51A.variant132", "S113.variant133", "S52A.variant134",
"S226.variant135", "S246.variant136",
"S145.variant137", "S144.variant138", "S151.variant139",
"S114.variant140", "S245.variant141", "S115.variant142",
"S116.variant143", "S224.variant144", "S104.variant145",
"S119.variant146", "S171.variant147", "S105.variant148",
"S179.variant149", "S117.variant150", "S118.variant151",
"S220.variant152", "S229.variant153", "S233.variant154",
"S231.variant155", "S78A.variant156", "S149.variant157",
"S114.variant158", "S251.variant159", "S120.variant160",
"S225.variant161", "S121.variant162", "S122.variant163",
"S123.variant164", "S137.variant165", "S124.variant166",
"S241.variant167", "S125.variant168", "S197.variant169",
"S252.variant170", "S126.variant171", "S148.variant172",
"S127.variant173", "S216.variant174", "S155.variant175",
"S128.variant176", "S261.variant177", "S193.variant178",
"S227.variant179", "S88A.variant180", "S187.variant181",
"S80A.variant182", "S54A.variant183",
"S169.variant184", "S55A.variant185", "S93A.variant186",
"S236.variant187", "S56A.variant188", "S210.variant189",
"S265.variant190", "S57A.variant191", "S242.variant192",
"S233.variant193", "S58A.variant194", "S183.variant195",
"S59A.variant196", "S129.variant197", "S240.variant198",
"S130.variant199", "S255.variant200", "S213.variant201",
"S90A.variant202", "S131.variant203", "S136.variant204",
"S74A.variant205", "S132.variant206", "S201.variant207",
"S133.variant208", "S234.variant209", "S165.variant210",
"S195.variant211", "S134.variant212", "S203.variant213",
"S141.variant214", "S185.variant215", "S79A.variant216",
"S150.variant217", "S113.variant218", "S229.variant219",
"S135.variant220", "S209.variant221", "S254.variant222",
"S45A.variant223", "S136.variant224", "S204.variant225",
"S82A.variant226", "S153.variant227",
"S91A.variant228", "S247.variant229",
"S137.variant230", "S211.variant231",
"S75A.variant232", "S228.variant233",
"S103.variant234", "S163.variant235", "S61A.variant236",
"S102.variant237", "S140.variant238", "S192.variant239",
"S81A.variant240", "S138.variant241", "S237.variant242",
"S175.variant243", "S140.variant244", "S62A.variant245",
"S190.variant246", "S47A.variant247", "S139.variant248",
"S220.variant249", "S141.variant250", "S181.variant251",
"S154.variant252", "S142.variant253", "S170.variant254",
"S98A.variant255", "S63A.variant256", "S111.variant257",
"S143.variant258", "S200.variant259", "S160.variant260",
"S83A.variant261", "S188.variant262", "S87A.variant263",
"S166.variant264", "S144.variant265",
"S184.variant266", "S189.variant267", "S64A.variant268",
"S145.variant269", "S159.variant270", "S266.variant271",
"S107.variant272", "S164.variant273", "S146.variant274",
"S230.variant275", "S147.variant276", "S180.variant277",
"S148.variant278", "S65A.variant279",
"S174.variant280", "S73A.variant281",
"S149.variant282", "S108.variant283", "S151.variant284",
"S172.variant285", "S152.variant286", "S153.variant287",
"S154.variant288", "S66A.variant289", "S155.variant290",
"S235.variant291", "S162.variant292", "S67A.variant293",
"S191.variant294", "S156.variant295", "S68A.variant296",
"S157.variant297", "S158.variant298", "S199.variant299",
"S95A.variant300", "S159.variant301", "S238.variant302",
"S194.variant303", "S160.variant304", "S161.variant305",
"S198.variant306", "S69A.variant307"))
I hope this helps.
Thanks for getting back to me with the output I requested. I see several issues.
The show
output reports that you have 307 samples and 3,222,628 variants. However, your data matrix appears to be 98% missing data. I do not know what your specific application is, but if this were my data I would be concerned. Is it possible that this was the result of a low quality sequencing run? Or possibly you aimed at too low of coverage per individual? Because this does appear to be the issue you've encountered. You can explore the distribution of sequence depth in this post to get an idea of what sort of sequence depth you have throughout your data. In fact, I would recommend you look through all the posts in my GBS class because it should help you identify samples and/or variants with a high degree of missingness and omit them. (Note that these pages were prepared in the context of GBS, but they are just as relevant for working with any VCF data.)
The problem you've encountered appears to be due to samples that are entirely missing data. Your sample 12 appears to be an example. We do not need a large data set to illustrate this. You can find suggestions on how to post an issue at this post.
library(vcfR)
data(vcfR_test)
vcfR_test
dp <- extract.gt(vcfR_test, element="DP", as.numeric=TRUE)
# Set one column of dp to NA
is.na(dp[,3]) <- TRUE
dp
amins <- abs(apply(dp, MARGIN=2, min, na.rm = TRUE))
Produces the error.
Warning message:
In FUN(newX[, i], ...) : no non-missing arguments to min; returning Inf
When we explore the data we see the following.
amins
NA00001 NA00002 NA00003
1 0 Inf
This reproduces your reported issue. Its a logical error, you can't take a minimum if the column consists completely on NA. At this point this issue is propagated throughout the rest of your code. Some R functions handle it better than others. But by the time you try to rank your variants you will have variants that have no criteria to rank them with.
I think you need to back up and familiarize yourself with what is in your dataset and try to clean it up some before you attempt to analyze it.
Good luck! Brian
Hi Brian,
Thanks for your help. I am indeed working with someone else's data and helping to QC it. I've flagged that there is 98% data is missing that can cause problems. I'm still learning as well :)
Is it possible to add a better error message for these situations to rank.variants.chromR(chrom, scores)
? Maybe indicating the sample name(s) with invalid scores.
The chromqc() plots of the VCF file look "okay" I'd guess. However, because of the size of the chromosome, the plots won't show individual samples that are junk.
I've plotted the sequence depth using the following code:
dp2 <- dp
dp2[ dp2 == 0 ] <- NA
boxplot(dp2, las=2, col=2:5, main="Sequence Depth (DP)", ylab="Depth (log)", log="y", xaxt="n")
abline(h=10^c(0:4), lty=3, col="#808080")
axis(1, cex.axis=0.3, las=2, at=c(1:ncol(dp2)), labels=colnames(dp2))
Thanks again!
I like to think that one of the main points of vcfR is to familiarize yourself with what your data looks like. I always advocate using show()
to get that first peek. There's also queryMETA()
to help you get an idea of what data your specific VCF file has.
I can see value in trying to catch unexpected behaviour and provide a more meaningful error message. But we haven't actually reproduced an error involving a vcfR function yet. I hope you're learning to identify some of these issues that are outside of vcfR. In order to catch an error you'll need to help me reproduce the error locally. You report that
chrom <- rank.variants.chromR(chrom, scores)
throws the following error.
Error in rank.variants.chromR(chrom, scores) : index out of bounds
Calls: rank.variants.chromR -> .Call
Execution halted
This suggests that there is either something wrong with your chromR object or your vector named 'scores'. I'm going to guess that its the vector named 'scores'. Could you recreate the error on your machine to validate you've recreated the error and then post the results of:
dput(scores)
This should give me an idea of things I can try on my end to reproduce the error. If I can reproduce the error then I can engineer a way to catch it in my code and try to provide a more meaningful error message.
Thanks! Brian
Hi Brian,
Thanks for your help. Yes, I agree 100%, the need to understand the quirks of ones data....which I'm working on :)
The scores
is a sizable:
> length(scores)
[1] 3222628
That is at least the same number of variants reported by the vcf
object. Because of scores
size, I"ve saved the dput(scores)
to a file.
dput(scores, files = "scores.txt")
...that yields a 149MB file... Hence, it is a little too big to post here. So, I checked for inf/null/na entries that might catch something...
$ grep -i -e "null\|inf\|na" scores.txt
1.01867021276596, 1.01852289512555), .Names = c("ST4.03ch01_26006_1",
So it only catches ".Names".
The summary()
of scores looks okay too.
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.019 1.030 7.649 3.068 476.000
Thanks
Oh yeah, I should have realized the scores vector would be large. Sorry about that. But I'm afraid I'm at a loss here. If I can't reproduce this issue on my side then its pretty difficult for me to troubleshoot it. We either need a way for you to modify one of my existing examples so that it reproduces this behaviour or we need to modify your data in a manner where you can share it with me. You could try replacing:
chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=FALSE)
for:
chrom <- create.chromR(name="Supercontig", vcf=vcf[1:100,], seq=dna, ann=gff, verbose=FALSE)
This will subset your VCF data to the first 100 variants. You could try different intervals until you find one that reproduces the error. (You may want to start with larger intervals.) By subsetting the data in this way you may be able to isolate a small number of variants that are causing your trouble. Alternatively, if this doesn't work it may indicate that the trouble is not the VCF data and perhaps the sequence or annotation.
Other sources of information that may help me would be using functions that take a peak at your data. From the data in the ranking data
example we can use:
head(chrom)
show(dna)
head(gff)
That might help us. But again, without a reproducible example there isn't much I can do.
Thanks! Brian
Hi Brian,
Whilst I subset the VCF to see if I can pinpoint the error. Here is the info about chrom, dna and gff.
> head(chrom)
***** Class chromR, method head *****
Name: Potato Contig
Length: 88,663,952
***** Sample names (chromR) *****
[1] "S206.variant"
[2] "S84A.variant2"
[3] "S186.variant3"
[4] "S86A.variant4"
[5] "S139.variant5"
[6] "S268.variant6"
[7] "S205.variant7"
[8] "S215.variant8"
[9] "S230.variant9"
[10] "S232.variant10"
[11] "S85A.variant11"
[12] "S231.variant12"
[13] "S239.variant13"
[14] "S248.variant14"
[15] "S217.variant15"
[16] "S222.variant16"
[17] "S176.variant17"
[18] "S40A.variant18"
[19] "S85A.variant19"
[20] "S223.variant20"
[21] "S76A.variant21"
[22] "S249.variant22"
[23] "S214.variant23"
[24] "S41A.variant24"
[25] "S178.variant25"
[26] "S99A.variant26"
[27] "S214.variant27"
[28] "S86A.variant28"
[29] "S253.variant29"
[30] "S44A.variant30"
[31] "S87A.variant31"
[32] "S162.variant32"
[33] "S43A.variant33"
[34] "S97A.variant34"
[35] "S260.variant35"
[36] "S88A.variant36"
[37] "S112.variant37"
[38] "S221.variant38"
[39] "S42A.variant39"
[40] "S60A.variant40"
[41] "S89A.variant41"
[42] "S106.variant42"
[43] "S257.variant43"
[44] "S212.variant44"
[45] "S143.variant45"
[46] "S152.variant46"
[47] "S109.variant47"
[48] "S250.variant48"
[49] "S110.variant49"
[50] "S221.variant50"
[51] "S258.variant51"
[52] "S161.variant52"
[53] "S92A.variant53"
[54] "S218.variant54"
[55] "S96A.variant55"
[56] "S177.variant56"
[57] "S70A.variant57"
[58] "S94A.variant58"
[59] "S142.variant59"
[60] "S211.variant60"
[61] "S90A.variant61"
[62] "S91A.variant62"
[63] "S262.variant63"
[64] "S92A.variant64"
[65] "S227.variant65"
[66] "S93A.variant66"
[67] "S216.variant67"
[68] "S259.variant68"
[69] "S94A.variant69"
[70] "S222.variant70"
[71] "S95A.variant71"
[72] "S215.variant72"
[73] "S226.variant73"
[74] "S225.variant74"
[75] "S96A.variant75"
[76] "S100.variant76"
[77] "S256.variant77"
[78] "S97A.variant78"
[79] "S267.variant79"
[80] "S146.variant80"
[81] "S98A.variant81"
[82] "S150.variant82"
[83] "S72A.variant83"
[84] "S244.variant84"
[85] "S99A.variant85"
[86] "S208.variant86"
[87] "S217.variant87"
[88] "S232.variant88"
[89] "S100.variant89"
[90] "S223.variant90"
[91] "S48A.variant91"
[92] "S101.variant92"
[93] "S243.variant93"
[94] "S168.variant94"
[95] "S210.variant95"
[96] "S84A.variant96"
[97] "S269.variant97"
[98] "S102.variant98"
[99] "S207.variant99"
[100] "S167.variant100"
[101] "S77A.variant101"
[102] "S103.variant102"
[103] "S173.variant103"
[104] "S104.variant104"
[105] "S138.variant105"
[106] "S105.variant106"
[107] "S213.variant107"
[108] "S106.variant108"
[109] "S107.variant109"
[110] "S202.variant110"
[111] "S218.variant111"
[112] "S89A.variant112"
[113] "S108.variant113"
[114] "S224.variant114"
[115] "S263.variant115"
[116] "S101.variant116"
[117] "S264.variant117"
[118] "S156.variant118"
[119] "S212.variant119"
[120] "S109.variant120"
[121] "S135.variant121"
[122] "S219.variant122"
[123] "S49A.variant123"
[124] "S228.variant124"
[125] "S110.variant125"
[126] "S196.variant126"
[127] "S111.variant127"
[128] "S112.variant128"
[129] "S50A.variant129"
[130] "S219.variant130"
[131] "S182.variant131"
[132] "S51A.variant132"
[133] "S113.variant133"
[134] "S52A.variant134"
[135] "S226.variant135"
[136] "S246.variant136"
[137] "S145.variant137"
[138] "S144.variant138"
[139] "S151.variant139"
[140] "S114.variant140"
[141] "S245.variant141"
[142] "S115.variant142"
[143] "S116.variant143"
[144] "S224.variant144"
[145] "S104.variant145"
[146] "S119.variant146"
[147] "S171.variant147"
[148] "S105.variant148"
[149] "S179.variant149"
[150] "S117.variant150"
[151] "S118.variant151"
[152] "S220.variant152"
[153] "S229.variant153"
[154] "S233.variant154"
[155] "S231.variant155"
[156] "S78A.variant156"
[157] "S149.variant157"
[158] "S114.variant158"
[159] "S251.variant159"
[160] "S120.variant160"
[161] "S225.variant161"
[162] "S121.variant162"
[163] "S122.variant163"
[164] "S123.variant164"
[165] "S137.variant165"
[166] "S124.variant166"
[167] "S241.variant167"
[168] "S125.variant168"
[169] "S197.variant169"
[170] "S252.variant170"
[171] "S126.variant171"
[172] "S148.variant172"
[173] "S127.variant173"
[174] "S216.variant174"
[175] "S155.variant175"
[176] "S128.variant176"
[177] "S261.variant177"
[178] "S193.variant178"
[179] "S227.variant179"
[180] "S88A.variant180"
[181] "S187.variant181"
[182] "S80A.variant182"
[183] "S54A.variant183"
[184] "S169.variant184"
[185] "S55A.variant185"
[186] "S93A.variant186"
[187] "S236.variant187"
[188] "S56A.variant188"
[189] "S210.variant189"
[190] "S265.variant190"
[191] "S57A.variant191"
[192] "S242.variant192"
[193] "S233.variant193"
[194] "S58A.variant194"
[195] "S183.variant195"
[196] "S59A.variant196"
[197] "S129.variant197"
[198] "S240.variant198"
[199] "S130.variant199"
[200] "S255.variant200"
[201] "S213.variant201"
[202] "S90A.variant202"
[203] "S131.variant203"
[204] "S136.variant204"
[205] "S74A.variant205"
[206] "S132.variant206"
[207] "S201.variant207"
[208] "S133.variant208"
[209] "S234.variant209"
[210] "S165.variant210"
[211] "S195.variant211"
[212] "S134.variant212"
[213] "S203.variant213"
[214] "S141.variant214"
[215] "S185.variant215"
[216] "S79A.variant216"
[217] "S150.variant217"
[218] "S113.variant218"
[219] "S229.variant219"
[220] "S135.variant220"
[221] "S209.variant221"
[222] "S254.variant222"
[223] "S45A.variant223"
[224] "S136.variant224"
[225] "S204.variant225"
[226] "S82A.variant226"
[227] "S153.variant227"
[228] "S91A.variant228"
[229] "S247.variant229"
[230] "S137.variant230"
[231] "S211.variant231"
[232] "S75A.variant232"
[233] "S228.variant233"
[234] "S103.variant234"
[235] "S163.variant235"
[236] "S61A.variant236"
[237] "S102.variant237"
[238] "S140.variant238"
[239] "S192.variant239"
[240] "S81A.variant240"
[241] "S138.variant241"
[242] "S237.variant242"
[243] "S175.variant243"
[244] "S140.variant244"
[245] "S62A.variant245"
[246] "S190.variant246"
[247] "S47A.variant247"
[248] "S139.variant248"
[249] "S220.variant249"
[250] "S141.variant250"
[251] "S181.variant251"
[252] "S154.variant252"
[253] "S142.variant253"
[254] "S170.variant254"
[255] "S98A.variant255"
[256] "S63A.variant256"
[257] "S111.variant257"
[258] "S143.variant258"
[259] "S200.variant259"
[260] "S160.variant260"
[261] "S83A.variant261"
[262] "S188.variant262"
[263] "S87A.variant263"
[264] "S166.variant264"
[265] "S144.variant265"
[266] "S184.variant266"
[267] "S189.variant267"
[268] "S64A.variant268"
[269] "S145.variant269"
[270] "S159.variant270"
[271] "S266.variant271"
[272] "S107.variant272"
[273] "S164.variant273"
[274] "S146.variant274"
[275] "S230.variant275"
[276] "S147.variant276"
[277] "S180.variant277"
[278] "S148.variant278"
[279] "S65A.variant279"
[280] "S174.variant280"
[281] "S73A.variant281"
[282] "S149.variant282"
[283] "S108.variant283"
[284] "S151.variant284"
[285] "S172.variant285"
[286] "S152.variant286"
[287] "S153.variant287"
[288] "S154.variant288"
[289] "S66A.variant289"
[290] "S155.variant290"
[291] "S235.variant291"
[292] "S162.variant292"
[293] "S67A.variant293"
[294] "S191.variant294"
[295] "S156.variant295"
[296] "S68A.variant296"
[297] "S157.variant297"
[298] "S158.variant298"
[299] "S199.variant299"
[300] "S95A.variant300"
[301] "S159.variant301"
[302] "S238.variant302"
[303] "S194.variant303"
[304] "S160.variant304"
[305] "S161.variant305"
[306] "S198.variant306"
[307] "S69A.variant307"
***** VCF fixed data (chromR) *****
CHROM POS ID REF ALT QUAL FILTER
[1,] "ST4.03ch01" "26006" NA "C" "T" "66.09" "PASS"
[2,] "ST4.03ch01" "26014" NA "A" "T" "66.09" "PASS"
[3,] "ST4.03ch01" "133102" NA "G" "A" "25.10" "SnpCluster"
[4,] "ST4.03ch01" "133103" NA "G" "A" "25.10" "SnpCluster"
[5,] "ST4.03ch01" "133123" NA "G" "A" "66.09" "SnpCluster"
[6,] "ST4.03ch01" "133139" NA "C" "A" "66.09" "SnpCluster"
INFO column has been suppressed, first INFO record:
[1] "AC=4" "AF=1.00" "AN=4" "DP=2"
[5] "FS=0.000" "MLEAC=4" "MLEAF=1.00" "MQ=60.00"
[9] "QD=33.04" "SOR=0.693" "set=variant107"
***** VCF genotype data (chromR) *****
***** First 6 columns *********
FORMAT S206.variant S84A.variant2
[1,] "GT:AD:DP:GQ:PL" NA NA
[2,] "GT:AD:DP:GQ:PL" NA NA
[3,] "GT:AD:DP:GQ:PL" NA NA
[4,] "GT:AD:DP:GQ:PL" NA NA
[5,] "GT:AD:DP:GQ:PL" NA NA
[6,] "GT:AD:DP:GQ:PL" NA NA
S186.variant3 S86A.variant4 S139.variant5
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
[4,] NA NA NA
[5,] NA NA NA
[6,] NA NA NA
***** Var info (chromR) *****
***** First 6 columns *****
CHROM POS MQ DP mask n
1 ST4.03ch01 26006 60 2 TRUE 1
2 ST4.03ch01 26014 60 2 TRUE 1
3 ST4.03ch01 133102 60 2 TRUE 1
4 ST4.03ch01 133103 60 2 TRUE 1
5 ST4.03ch01 133123 60 2 TRUE 1
6 ST4.03ch01 133139 60 4 TRUE 2
***** VCF mask (chromR) *****
Percent unmasked: 100
***** End head (chromR) *****
> show(dna)
12 DNA sequences in binary format stored in a list.
Mean sequence length: 60418115
Shortest sequence: 45475667
Longest sequence: 88663952
Labels:
ST4.03ch01
ST4.03ch02
ST4.03ch03
ST4.03ch04
ST4.03ch05
ST4.03ch06
...
More than 10 million nucleotides: not printing base composition
> head(gff)
V1 V2 V3 V4 V5 V6 V7 V8
1 ST4.03ch01 BGI gene 152322 153489 . . .
2 ST4.03ch01 Cufflinks mRNA 152322 153489 . - .
3 ST4.03ch01 Cufflinks exon 153389 153489 . - .
4 ST4.03ch01 Cufflinks exon 152322 152593 . - .
5 ST4.03ch01 Cufflinks intron 152594 153388 . - .
6 ST4.03ch01 BestORF CDS 152418 152576 . - 0
V9
1 ID=PGSC0003DMG400015133;name="Defensin"
2 ID=PGSC0003DMT400039136;Parent=PGSC0003DMG400015133;Source_id=RNASEQ26.809.0;Mapping_depth=16.192011;Class=4;name="Defensin"
3 ID=PGSC0003DME400103709;Parent=PGSC0003DMT400039136
4 ID=PGSC0003DME400103710;Parent=PGSC0003DMT400039136
5 ID=PGSC0003DMI400065839;Parent=PGSC0003DMT400039136
6 ID=PGSC0003DMC400026563;Parent=PGSC0003DMT400039136;name="Defensin"
Well, it seems we could handle the samples names a little more elegantly in the chromR head method.
Modified chromR head to summarize samples names at 26a4ee8564898909aa1fb67e0f3c470c75d262d6.
Other than that, nothing jumps out at me as being out of place. You may want to double check that you're working with only one chromosome:
https://knausb.github.io/vcfR_documentation/subset_data_to_1chrom.html
Thanks!
Hi Brian,
Sorry for the delay, had to working on other stuff.
I've tried to subset VCF, but there appears to be another problem that is probabily related. The memory use of R goes through the roof when running the command:
chrom <- rank.variants.chromR(chrom, scores)
Memory use of R goes up to 600GB according to top. :(
To help identify the memory problem, I've used the R function defined from a topic on stackoverflow:
https://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session
I've reproduced it here for reference:
showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {
objectList <- ls(parent.frame())
oneKB <- 1024
oneMB <- 1048576
oneGB <- 1073741824
memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))
memListing <- sapply(memoryUse, function(size) {
if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))
else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))
else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))
else return(paste(size, "bytes"))
})
memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)
if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),]
else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"
if(!missing(limit)) memListing <- memListing[1:limit,]
print(memListing, row.names=FALSE)
return(invisible(memListing))
}
As a precaution, I performed garbage collection in the running R session.
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 14421920 770.3 20885653 1115.5 14540262 776.6
Vcells 1614503174 12317.7 6631073171 50591.1 7982538277 60902.0
The showMemoryUse() function shows that the workspace looks reasonable for the size of the files used.
> showMemoryUse(decreasing=TRUE, limit=10)
objectName memorySize
vcf 8.64 GB
chrom 3.21 GB
dna 691.43 MB
scores 270.45 MB
gff 78.21 MB
showMemoryUse 32.55 kB
amins 26.68 kB
mids 26.68 kB
filename_dna 136 bytes
filename_gff 120 bytes
Whilst, according to the top command, the R session is using a sizable ~30GB of memory.
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19033 user1 20 0 30.4g 30g 5316 T 0.0 2.0 15:54.43 R
Let's run the rank.variants.chromR() function (using the full vcf data structure) and see what happens...
> chrom <- rank.variants.chromR(chrom, scores)
Error in rank.variants.chromR(chrom, scores) : index out of bounds
The error is as before, but the memory use of R whilst performing this opertion is crazy. According to top, it's allocated 600GB...
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19033 user1 20 0 615g 606g 6228 S 0.0 40.0 105:34.38 R
The R workspace memory usage is still about the same as before calling the command.
> showMemoryUse(decreasing=TRUE, limit=10)
objectName memorySize
vcf 8.64 GB
chrom 3.21 GB
dna 691.43 MB
scores 270.45 MB
gff 78.21 MB
showMemoryUse 32.55 kB
amins 26.68 kB
mids 26.68 kB
increment_variance 264 bytes
start_variance 264 bytes
It does not make sense that appending the scores object (280MB) to the chrom object (3.21GB) would need +600GB of space....memory leak?
How can I help debug this memory issue? Even if it's not related to the "Error in rank.variants.chromR(chrom, scores) : index out of bounds" problem.
Although, the memory usage is problem for resource use. I guess it is because R passes by value, instead of by reference. I could not find a way to run rank.variants.chromR(chrom, scores) and replace the chrom object "in place".
Cheers
Hi JustGitting,
No worries on the response time. In fact, I'm on holiday now, so I'll probably be fairly unresponsive until next week. In the mean time, I hope you'll consider creating a minimal reproducible example. I've put an example here which includes a link to this stack overflow post.
Perhaps it would be a good exercise for you to review this thread and ask yourself 'what do you expect me to do?' In the absence of a minimal reproducible example I see this thread as 'not actionable' as there does not appear to be any way for me to act on this report. If you could produce a minimal reproducible example it would help me validate the behaviour you are reporting is reproducible on my machine (i.e., this is not specific to your machine), it would allow me to attempt to troubleshoot the matter and if/when I/we have a solution, I/we could engineer unit tests to make sure this 'fix' remains fixed. In the absence of this critical piece in this conversation, I really don't see that there is anything I can do to address your report. Please let me know if there is anything I can do to help you in constructing a minimal reproducible example. I've made several example datasets available so I feel there is adequate material to help you in this. Please let me know if I can help you modify these examples to help reproduce the behavior you're reporting.
Thanks! Brian
Hi Brian,
I understand, I'll try and create a reproducable example.
In the mean time, to answer your questions "what do you expect me to do?"
As I wrote in my initial post, I tried to debug the code, but was unsuccessful, eg using $ R --debugger=gdb I'd like to know how to add/modify/etc vcfR so that I can identify the location within vcfR that produces the message "Error in rank.variants.chromR(chrom, scores) : index out of bounds" as it could be anywhere in the code.
Even if I can reproduce the error with a small data set, we still don't know where the "index out of bounds" message is coming from.
How do you normally debug your code?
Can you recommend a good guide on how to debug R/Rcpp to catch these types of tricky errors?
I'm keen to help and learn how to debug R and Rcpp as it's a great way to learn. Thanks for your help. I'll update once I've created a reproducable example (if I can :) )
Regards
Hi JustGitting,
Sorry for the delayed response, I've had some other things to deal with. I think your question about debugging R and Rcpp is a bit beyond the scope of this repository. You can find information on this online elsewhere. Here's some examples.
http://r-pkgs.had.co.nz/src.html#src-debugging
http://kevinushey.github.io/blog/2015/04/13/debugging-with-lldb/
http://kevinushey.github.io/blog/2015/04/05/debugging-with-valgrind/
If you really think you've found a memory leak you may need to use the Address Sanitizer. This is a bit involved because you need a version of R that was compiled with this functionality. My current solution is to use rocker. I've blogged on this.
https://knausb.github.io/2017/06/running-r-devel-ubsan-clang-in-docker/
I hope that gets you pointed in the right direction. You'll still need a reproducible example in order to troubleshoot this. Let me know if you need help creating an example.
Hi Brian,
I've been able to create a reproducable example, sort of obvious now....
In short, it is because the scores are calculated from all snp's for all chromosomes. However, when the chromR object is created, it automatically selects the first chromosome and subsets the vcf file. So it only stores the SNP's for the single chromosome. Hence, when we try to add the scores, there are too many and it goes out of bounds.
The below example just extends the scores vector by one element to reproduce the error.
library(vcfR)
data(vcfR_test)
print('#################################')
print('Extract Genome Quality (GQ) and Sequence Depth (DP) data from VCF.')
print('#################################')
# Create scores to rank samples.
gq <- extract.gt(vcfR_test, element='GQ', as.numeric=TRUE)
print("Summary - GQ")
summary(gq)
dp <- extract.gt(vcfR_test, element='DP', as.numeric=TRUE)
print("Summary - DP")
summary(dp)
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
print('#################################')
print('Filter sequence depth.')
print('#################################')
mids <- apply(dp, MARGIN=2, median, na.rm=TRUE)
dp2 <- sweep(dp, MARGIN=2, mids, FUN="-")
dp2 <- abs(dp2)
dp2 <- -1 * dp2
summary(dp2)
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
print('#################################')
print('Normalize sequence depths.')
print('#################################')
print('Calculate abs(minimum) for each sample.')
amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
print('Add column min to each column')
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="+")
print('Divided all columns by min.')
dp2 <- sweep(dp2, MARGIN=2, STATS = amins, FUN="/")
range(dp2, na.rm=TRUE)
summary(dp2)
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
# Normalize dp2 and gq
print('#################################')
print('Normalize genotype quality depths.')
print('#################################')
gq2 <- gq/100
range(gq2, na.rm=TRUE)
summary(gq2)
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
scores <- dp2 + gq2
scores <- rowSums(scores, na.rm = TRUE)
summary(scores)
length(scores)
chrom <- create.chromR(name="Supercontig", vcf=vcfR_test, verbose=FALSE)
#chrom <- masker(chrom, min_DP = 900, max_DP = 1500)
chrom <- proc.chromR(chrom, verbose = TRUE)
# This works as expected
chrom <- rank.variants.chromR(chrom, scores)
# Lets extend scores so it is bigger than the vcf data inside chromR object.
scores["test"] <- 2.2
length(scores)
chrom <- rank.variants.chromR(chrom, scores)
# Should get the error:
# Error in rank.variants.chromR(chrom, scores) : index out of bounds
I guess the vcf file should only contain SNP's from a single chromosome, but I feel this should be handled better.
Hope this helps.
Hi JustGitting,
That helps a lot! But we can simplify a bit more. We can start with your example.
library(vcfR)
data(vcfR_test)
chrom <- create.chromR(name="Supercontig", vcf=vcfR_test, verbose=FALSE)
chrom <- proc.chromR(chrom, verbose = TRUE)
Because this is an example, the scores are arbitrary. But we'll use a seed for reproducibility.
set.seed(9)
scores <- runif(n=nrow(vcfR_test))
chrom <- rank.variants.chromR(chrom, scores)
This should work. If we add one more score than we have variants we should reproduce your report.
set.seed(9)
scores <- runif(n=nrow(vcfR_test) + 1)
chrom <- rank.variants.chromR(chrom, scores)
But I see:
Error in rank.variants.chromR(chrom, scores) :
Index out of bounds: [index=5; extent=5].
Can you validate this?
We can add a test to the function rank.variants.chromR()
to catch this.
if( nrow(x@vcf) != length(scores) ){
msg <- "The number of variants and scores do not match."
msg <- paste(msg, " nrow(x@vcf): ", nrow(x@vcf), sep = "")
msg <- paste(msg, ", length(scores): ", length(scores), sep = "")
stop(msg)
}
which produces this message.
Error in rank.variants.chromR(chrom, scores) :
the number of variants and scores do not match. nrow(x@vcf): 5, length(scores): 6
Do you feel this handles this situation better? If not, can you recommend an improvement?
Thanks!
Hi Brian,
When I try:
set.seed(9)
scores <- runif(n=nrow(vcfR_test) + 1)
chrom <- rank.variants.chromR(chrom, scores)
I get a different error to you:
Error in rank.variants.chromR(chrom, scores) : index out of bounds
Strange??
Regarding your suggested test, that would be great!
Thank you very much!
Hi JustGitting,
I see that we have non-identical errors, but they are both 'index out of bounds' errors. I've reviewed the beginning of this thread where you state you are using R 3.3.1 whereas I am using R 3.4.1. So I'm going to interpret this non-identity as due to the different versions. They seem to 'smell' very similar.
I've made the proposed update to vcfR to handle this condition and it should appear in our next release to CRAN. Looks like it will be vcfR 1.6.0. In the mean time, if you are interested, the changes are now in the master branch at GitHub. You can install this with:
devtools::install_github(repo="knausb/vcfR", build_vignettes=TRUE)
I've also used the minimal reproducible example we made to create unit tests to ensure this fix will remain in future versions.
Thanks for your help and let me know if you encounter any other issues! Brian
Sorry for the late reply, thanks for creating a great package.
Hi,
R version: 3.3.1 vcfR version: 1.5 OS: linux
I'm following the "Ranking Data" Vignette (https://cran.r-project.org/web/packages/vcfR/vignettes/ranking_data.html) to QA a very large VCF file (~8GB file, ~3.5M calls). I'm using a sufficiently large computer running linux to handle this problem :)
Further, the VCF has four chromosomes, and the dna/gff files contain 10 chromosomes. Hence, vcfR will pick the first chromosome.
For ease of reference, below is the code from the Ranking Data Vignette example I'm using (mostly) unchanged.
The first sign of (possibly related) problems is when the minimums for each column (sample) are calculated by the code:
amins <- abs(apply(dp2, MARGIN=2, min, na.rm = TRUE))
This produces the following warning:
I'd guess this is because the sample column(s) has all NA's.... should not be a problem?
Finally, when the scores are added to the chrom object:
chrom <- rank.variants.chromR(chrom, scores)
it fails with:
Given the error is from calling a Rcpp function, I've attempted to debug this problem with the tips from http://r-pkgs.had.co.nz/src.html using the following steps:
However, the error is not caught and is simply displayed as above.
This error does not occur when I load a limited number of VCF calls. For example, when loading the vcf data by the command:
vcf <- read.vcfR(filename_vcf, verbose = TRUE, nrows = 100000)
I'm still learning R and have not touched anything with Rcpp, etc. Can you provide me with the steps to debug this so I can help you fix this error?
Do I need to clone the git repos and re-compile locally?
Thanks for your help.