knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Modified read.vcfR so that when nrow=0, it's much faster #69

Closed rcorty closed 7 years ago

rcorty commented 7 years ago
  1. Created two Rcpp functions 'stat_line_no_variants' and 'vcfR_vcf_stats_no_variants_gz'.

  2. Modified read.vcfR so that when the argument nrow=0, it uses 'vcfR_vcf_stats_no_variants_gz' to get the stats of the vcf file, rather than the standard version. When there are many variants, and the user only wants the meta and the participant IDs, this can save a lot of time.

rcorty commented 7 years ago

I think the main idea here is good -- if the user specifies nrow, he doesn't need an accurate response from 'vcfR_vcf_stats_gz', so we may as well not read all the lines in the file just to get an accurate 'stats'.

Though the idea is good, the implementation is kludgey.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.02%) to 42.163% when pulling daab03fac02d7f8e279ec7ded6edafc7699476cc on rcorty:master into 6e243073661c4d79d9a77822282200cc0a045b4e on knausb:devel.

knausb commented 7 years ago

Hi Robert, thanks for your interested in vcfR! However, you should be able to see that your PR failed tests, so I can't accept it in its current form. Perhaps we can discuss this to find a proper way forward.

After looking at your code I think I have an idea of what you're attempting. But let me explain it to you to see if we're on the same page. A call to read.vcfR() first calls vcf_stats_gz (not exported to the user) to scan the VCF file and collect some information on it. One of the things it does is collect the number of rows and columns in the matrix-like section of the file. This information is used to initialize an Rcpp::CharacterMatrix to hold this information. This avoids the alternative of growing a data structure as lines are read in, which typically has a performance consequence.

I think the issue you have with this behaviour is that if the user only desires to input the meta region and header, or just variants 1:10, or even variants 10:20, vcf_stats_gz still scans through the entire file. In large files this has a performance cost and if the user is only interested in the head of the file it is an unnecessary cost. Does this sound like your concern?

I am concerned by your choice to create new functions because they appears to have code that is redundant to existing vcfR code. Do you agree? It seems to me that it may be simpler to add parameters nrows and skip to vcf_stats_gz as an attempt make use of existing code but extending its functionality to provide the behaviour I think you're suggesting. Does this sound like a reasonable path forward? This may create side effects that I'm not considering. However, I want to believe I have these functions fairly well covered with unit tests, so hopefully any side effects would become obvious.

Thanks! Brian

rcorty commented 7 years ago

Brian,

Yes, you're understanding my goals exactly.

I agree that your proposed solution is much better. I'll take a stab at it.

Robert

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.08%) to 42.259% when pulling 694e6bc00a3c189a7cfd5c7f18648db944740d52 on rcorty:master into 6e243073661c4d79d9a77822282200cc0a045b4e on knausb:devel.

knausb commented 7 years ago

@rcorty this thread seems to have gone stale. Do you still consider yourself as working on this? If not, I think we should close this. If yes, I think we should close this and move it to an issue. I also suggest creating a branch for your work if its not too late for that. With this as a pull request I get tracking for your commits where what I would really like to see is the culmination of your commits as a fork. Thanks!

rcorty commented 7 years ago

@knausb I am no longer working on this. Please feel free to do anything you want with my proposed contribution (trash it, incorporate it "as is", or turn it into a branch). If you need anything from me to do any of those things, please let me know.

knausb commented 7 years ago

Ok. I'll close this and have created issue #70. I like your idea and appreciate your bringing the matter to my attention. By creating an issue I hope I'll get to include your suggestion in a future release. Thank you!