`parseVEP` crashes when reading extremely large files

uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator

https://uclahs-cds.github.io/package-moPepGen/

GNU General Public License v2.0

6 stars 1 forks source link

`parseVEP` crashes when reading extremely large files #683

Open lydiayliu opened 1 year ago

lydiayliu commented 1 year ago

I tried to parseVEP a 800M tsv from all of mouse dbSNP, here: /hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/Giansanti_Mouse/processed_data/mpg_snp_indel/VEP/dbSNP150_GCF_000001635.24-All.tsv.gz

parseVEP was Killed and I assume that it is because it tried to read in the entire file as the first step. This is probably unnecessary for parsing and could be switched to reading the file line by line?

zhuchcn commented 8 months ago

parseVEP does read and process records line by line, but the issue is they are not writen until the very end. And the only reason for this is variants are sorted before writing to disk. But the sorting is done at the transcript level, and the input VCF to VEP is always sorted which means records that are mapped to the same gene are always next to each other. So we can actually write records to GVF once the current transcript ID or gene ID is changed.

update: turns out that this is not the case, particularly when the same position can be mapped to multiple genes. Maybe we can sort the VEP file based on gene ID before parsing.

lydiayliu commented 8 months ago

Ahh got it. How bad is it for parseVEP to write a gvf that is not sorted?

zhuchcn commented 8 months ago

There doesn't seem to be an out-of-box solution to sort a very large file in an on-disk manner in Python. A possible solution is splitting the file into chunks, sorting each trunk, and then merge each chunk. The easiest solution still seems to be the GNU sort. We could implement a sorted VEP parser, that handles the VEP file when it's already sorted. We can add a flag such as --sorted to use this behavior, and if not specified, use the original parser that writes all results at the end.