Closed rcorty closed 7 years ago
I think the main idea here is good -- if the user specifies nrow, he doesn't need an accurate response from 'vcfR_vcf_stats_gz', so we may as well not read all the lines in the file just to get an accurate 'stats'.
Though the idea is good, the implementation is kludgey.
Hi Robert, thanks for your interested in vcfR! However, you should be able to see that your PR failed tests, so I can't accept it in its current form. Perhaps we can discuss this to find a proper way forward.
After looking at your code I think I have an idea of what you're attempting. But let me explain it to you to see if we're on the same page. A call to read.vcfR()
first calls vcf_stats_gz
(not exported to the user) to scan the VCF file and collect some information on it. One of the things it does is collect the number of rows and columns in the matrix-like section of the file. This information is used to initialize an Rcpp::CharacterMatrix
to hold this information. This avoids the alternative of growing a data structure as lines are read in, which typically has a performance consequence.
I think the issue you have with this behaviour is that if the user only desires to input the meta region and header, or just variants 1:10, or even variants 10:20, vcf_stats_gz
still scans through the entire file. In large files this has a performance cost and if the user is only interested in the head of the file it is an unnecessary cost. Does this sound like your concern?
I am concerned by your choice to create new functions because they appears to have code that is redundant to existing vcfR
code. Do you agree? It seems to me that it may be simpler to add parameters nrows
and skip
to vcf_stats_gz
as an attempt make use of existing code but extending its functionality to provide the behaviour I think you're suggesting. Does this sound like a reasonable path forward? This may create side effects that I'm not considering. However, I want to believe I have these functions fairly well covered with unit tests, so hopefully any side effects would become obvious.
Thanks! Brian
Brian,
Yes, you're understanding my goals exactly.
I agree that your proposed solution is much better. I'll take a stab at it.
Robert
@rcorty this thread seems to have gone stale. Do you still consider yourself as working on this? If not, I think we should close this. If yes, I think we should close this and move it to an issue. I also suggest creating a branch for your work if its not too late for that. With this as a pull request I get tracking for your commits where what I would really like to see is the culmination of your commits as a fork. Thanks!
@knausb I am no longer working on this. Please feel free to do anything you want with my proposed contribution (trash it, incorporate it "as is", or turn it into a branch). If you need anything from me to do any of those things, please let me know.
Ok. I'll close this and have created issue #70. I like your idea and appreciate your bringing the matter to my attention. By creating an issue I hope I'll get to include your suggestion in a future release. Thank you!
Created two Rcpp functions 'stat_line_no_variants' and 'vcfR_vcf_stats_no_variants_gz'.
Modified read.vcfR so that when the argument nrow=0, it uses 'vcfR_vcf_stats_no_variants_gz' to get the stats of the vcf file, rather than the standard version. When there are many variants, and the user only wants the meta and the participant IDs, this can save a lot of time.