masking: The current position of the input Variant is greater than the length of the chromosome as specified by the SequenceDict

zajacn commented 3 weeks ago

Hi Lucas,

Many thanks for this awesome tool and fantastic documentation. I have one issue regarding the behaviour of --filter-mask-bed parameter. I will use a test example to show what I mean.

I have a sync file of 10 SNPs on the same chromosome at positions 3364 - 3498 for 16 pools. The sync file has no missing data (all pools have a coverage of a minimum of 25 for each position) and no header. It looks like this (I put the first line):

agouti_scaf_1001 3364 T 0:32:0:0:0:0 0:46:0:0:0:0 0:44:0:0:0:0 0:23:0:0:0:0 0:48:0:0:0:0 0:51:0:0:0:0 0:58:0:8:0:0 0:46:0:8:0:0 0:32:0:0:0:0 0:75:0:0:0:0 0:64:0:6:0:0 0:62:0:0:0:0 0:40:0:0:0:0 0:69:0:0:0:0 0:47:0:0:0:0 0:47:0:0:0:0

The chromosome on which these SNPs are is 10,000bp long. I have a fasta file of that chromosome and my genome.fasta.fai looks like this:

agouti_scaf_1001 10000 17 80 81

I want to calculate Fst (Unbiased Hudson) over intervals of 2kb. But I also have a bed file for masked positions that looks as follows:

agouti_scaf_1001    3380    3390
agouti_scaf_1001    1200    1700
agouti_scaf_1001    4000    5000

The commands I tested were:

grenedalf fst --sync-path test.sync --window-type interval --window-average-policy window-length --method unbiased-hudson --write-pi-tables --pool-sizes 24 --reference-genome-fasta-file genome.fasta --window-interval-width 2000 --window-interval-stride 0 --out-dir . --file-prefix test.2kb --filter-mask-bed mask.txt

grenedalf fst --sync-path test.sync --window-type interval --window-average-policy window-length --method unbiased-hudson --write-pi-tables --pool-sizes 24 --reference-genome-fai-file genome.test.fai --window-interval-width 2000 --window-interval-stride 0 --out-dir . --file-prefix test.2kb --filter-mask-bed mask.txt

Both of the commands produce satisfactory results and run successfully but the message says:

Processed 4 windows. Processed 1 chromosome with 6781 positions.

Now when I modify the bed file used in --filter-mask-bed to contain any position above the 6781, e.g. as such:

agouti_scaf_1001        3380    3390
agouti_scaf_1001        1200    1700
agouti_scaf_1001        4000    5000
agouti_scaf_1001        6782    6790

I get an error message:

At chromosome agouti_scaf_1001
Error: The current position agouti_scaf_1001:10001 of the input Variant is greater than the length of the chromosome as specified by the SequenceDict, which is 10000

Why are only 6781 positions included if my chromosome is of length 10,000? Is there something I am doing wrong? I have created a test example because I get this error when running it on my real data. I would appreciate if you could help me understand the issue. Let me know if you need more information.

Best regards, Natalia

lczech commented 2 weeks ago

Hi Natalia @zajacn,

thanks for the great description of the issue and for testing it so thoroughly!

Hm, so if you already have a test case that reproduces the issue, could you please share that with me? That would make it way easier to track down what is going on here. My gut feeling is that some of the checks for file lengths etc might accidentally be done in the wrong order under your specific conditions. I'd like to debug this with your test data - let's see.

Cheers and so long Lucas

zajacn commented 1 week ago

Hi Lucas,

Thanks so much for your reply. I attach the test files I used for testing. test_files.zip

Best wishes Natalia

lczech / grenedalf

masking: The current position of the input Variant is greater than the length of the chromosome as specified by the SequenceDict #29