jydu / maffilter

The MafFilter genome alignment processor
GNU General Public License v3.0
17 stars 5 forks source link

Fasta output error #31

Open jacobscott7071 opened 1 week ago

jacobscott7071 commented 1 week ago

I've been running the 241 mammal genome alignment from Zoonomia (https://cglgenomics.ucsc.edu/november-2020-nature-mammalian-and-avian-alignments/) through a few different filters now. When I get to the point of outputting my file to fastas, I've used two versions of the parameter file below, one where coordinates=no and one where coordinates=yes.

input.file=/blue/cohn/ja.scott/filtered_maf/LEGSoR_allspecies_mergeall_5.5.maf.gz input.file.compression=gzip input.format=Maf

output.log=/blue/cohn/ja.scott/241mammals/logs/LEGSoR_allspecies_mergeall_5.5_mergeMM10_cleanup_output.log

maf.filter= \ Merge( \ species=(Mus_musculus), \ dist_max=0, \ ignore_chr=none, \ rename_chimeric_chromosomes=yes), \ MinBlockLength(min_length=30), \ OutputAlignments( \ file=/blue/cohn/ja.scott/LEG_5.5_mergeMM10.EC/elements_test/LEG_5.5mergeMM10%i.fasta, \ compression=none, \ format=Fasta, \ mask=no, \ coordinates=no) \

When I run coordinates=no, MafFilter outputs 9767 files. When I run coordinates=yes, MafFilter outputs 78243 files. I've rerun the filter multiple times, and I consistently get the same number of files for each version. The filtered maf file itself has 78229 'a' lines, which resembles the coordinates=yes file number, though is not an exact match. Excluding the sequence headers, the contents of the first 9767 files of coordinates=yes match their equivalents in coordinates=no. To me, this suggests that coordinates=no is omitting blocks from the output. Do you know why this would be?

jydu commented 5 days ago

Hi,

This is most strange. coordinates=yes/no should only impact the sequence names in the output alignment. It does not do any filtering. Would it be possible to have access to your input maf (be it a shortened version) so that I investigate what is going on?

All the best,

Julien.

jydu commented 4 days ago

UPDATE: I have tried with a file of my own and I cannot reproduce the issue. I would therefore need the original data to understand what is going on.

jacobscott7071 commented 1 day ago

Hi Julien -- Thanks for the quick reply and sorry for my late reply. Sounds like it's likely something on my end. I'd be happy to send you the original data, if you may have some insight as to what is going wrong. Is there a convenient way to send it to you without uploading it to this comment?

-- Jacob

jydu commented 1 day ago

Dear Jacob,

It is also difficult to handle all possible variations of the maf format :) How big is the file?

Best,

Julien.

jacobscott7071 commented 1 day ago

The particular file in question is 1.7gb. That said, it is ultimately derived from a 1.1TB file, and I suspect that some of the problems may have originated upstream during the filtering of this original file.

-- Jacob