Open lydiayliu opened 1 year ago
parseVEP does read and process records line by line, but the issue is they are not writen until the very end. And the only reason for this is variants are sorted before writing to disk. But the sorting is done at the transcript level, and the input VCF to VEP is always sorted which means records that are mapped to the same gene are always next to each other. So we can actually write records to GVF once the current transcript ID or gene ID is changed.
update: turns out that this is not the case, particularly when the same position can be mapped to multiple genes. Maybe we can sort the VEP file based on gene ID before parsing.
Ahh got it. How bad is it for parseVEP to write a gvf that is not sorted?
There doesn't seem to be an out-of-box solution to sort a very large file in an on-disk manner in Python. A possible solution is splitting the file into chunks, sorting each trunk, and then merge each chunk. The easiest solution still seems to be the GNU sort. We could implement a sorted VEP parser, that handles the VEP file when it's already sorted. We can add a flag such as --sorted
to use this behavior, and if not specified, use the original parser that writes all results at the end.
I tried to
parseVEP
a 800M tsv from all of mouse dbSNP, here:/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz
parseVEP
wasKilled
and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?