harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
69 stars 32 forks source link

Error "incorrect cell order" in gatk_db_import step #121

Closed reedjohnkenny closed 10 months ago

reedjohnkenny commented 1 year ago

Hi,

I am getting a recurring error when I run the gatk_db_import rule. db intervals 0-140 are created correctly, but all intervals above 140 error out. The full error is "Incorrect cell order found - cells must be in column major order. Previous cell: [ 28, 1004820318 ] current cell: [ 28, 1004820318 ]. The most likely cause is unexpected data in the input file: (a) A VCF file has two lines with the same genomic position (b) An unsorted CSV file (c) Malformed VCF file (or malformed index) See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data"

The cell numbers are different every time, but other than that it is the same error.

I have tried deleting the gvcf's and rerunning from there, as well as deleting the problematic db_intervals as well as the db_mapfile and rerunning from there. I also tried using bcftools to normalize the vcf's as recommended in the github page linked in the error, but got an improper gzip header error when I ran the pipeline with normalized files. The output from normalizing the files also did not show that it resolved any duplicated positions, so it seems like that is not the problem.

Attached is a log file from gatk_db_import showing the error.

Thanks for your help! 0220.txt

cademirch commented 1 year ago

Hey @reedjohnkenny, sorry I couldn't respond sooner. Unfortunately I haven't encountered this issue before. Seems like there could be duplicate variants in some gVCFs. Can you check this? Also if you can could you share your config.yaml?

@tsackton or @erikenbody have either of you seen this before?

tsackton commented 1 year ago

Hi @reedjohnkenny. This is a strange error, I have not seen this before. It would be useful to see your config.yaml file to help us debug, if you can share that.

The fact that when you use bcftools norm you get an improper gzip header error makes me think that perhaps there is some file corruption somewhere. Are you using a shared, networked file system for this run? I have sometimes seem corrupted file operations from network or disk latency using a SLURM cluster with a large networked scratch drive.

However, it is not obvious why this would affect the later intervals only.

When you deleted vcfs and reran, did you delete the merged gVCF for each sample, the interval gVCFs, or both?

reedjohnkenny commented 1 year ago

Thanks so much for looking into this.

I have attached the config.yaml.

I am working on a shared networked file system with a shared scratch drive on a SLURM cluster.

I only deleted the merged gVCF's, not the interval gVCF's. I had though about deleting the interval gVCF's that were messed up but got confused because the numbering on the the interval gVCF's is different than the numbering on the database gVCF's and thought I would check in with ya'll before I deleted them all and restarted that step.

I think bcftools norm showed that there were not duplicate variants when I ran it, but I will check that again.

Thanks again config.yaml.txt

tsackton commented 1 year ago

Hi @reedjohnkenny,

Can you try validating the interval gVCF files? Before further debugging it would be useful to rule out possible file corruption here.

The easiest way to do this is with the GATK ValidateVariants tool, something like gatk ValidateVariants -V file.g.vcf.gz --validate-GVCF. You can look here for more documentation: https://gatk.broadinstitute.org/hc/en-us/articles/360037057272-ValidateVariants

If this reveals a problem with one of the interval gVCFs, then I would suggest deleting the problematic intervals and rerunning. If not, we'll have to further debug.

We use different interval sizes for the gVCF step and the DB import step, so the intervals not matching is expected.

cademirch commented 1 year ago

@reedjohnkenny Were you able to resolve this?

reedjohnkenny commented 12 months ago

Hi Cade,

Sorry, I have had to switch back to lab and field work so I have not pursued this further. The CCGP bioinformatics team will run the pipeline on our full dataset soon so I don't think it's worthwhile for me to push forward with analyzing preliminary data on our cluster.

Thanks, Reed

On Wed, Sep 27, 2023 at 1:43 PM Cade Mirchandani @.***> wrote:

@reedjohnkenny https://github.com/reedjohnkenny Were you able to resolve this?

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/121#issuecomment-1738051588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARFUELWHSJO73A7QBSGL7HTX4SFYPANCNFSM6AAAAAA3XSBT4A . You are receiving this because you were mentioned.Message ID: @.***>