Open jacobscott7071 opened 2 months ago
Hi,
This is most strange. coordinates=yes/no should only impact the sequence names in the output alignment. It does not do any filtering. Would it be possible to have access to your input maf (be it a shortened version) so that I investigate what is going on?
All the best,
Julien.
UPDATE: I have tried with a file of my own and I cannot reproduce the issue. I would therefore need the original data to understand what is going on.
Hi Julien -- Thanks for the quick reply and sorry for my late reply. Sounds like it's likely something on my end. I'd be happy to send you the original data, if you may have some insight as to what is going wrong. Is there a convenient way to send it to you without uploading it to this comment?
-- Jacob
Dear Jacob,
It is also difficult to handle all possible variations of the maf format :) How big is the file?
Best,
Julien.
The particular file in question is 1.7gb. That said, it is ultimately derived from a 1.1TB file, and I suspect that some of the problems may have originated upstream during the filtering of this original file.
-- Jacob
Dear Jacob,
Ok. Could you contact me at my email address (which can be found here https://www.evolbio.mpg.de/2996566/group_molsysevolution), I will then reply with a dropbox link where you can upload the file. I will then give it a look!
Best,
Julien.
I've been running the 241 mammal genome alignment from Zoonomia (https://cglgenomics.ucsc.edu/november-2020-nature-mammalian-and-avian-alignments/) through a few different filters now. When I get to the point of outputting my file to fastas, I've used two versions of the parameter file below, one where coordinates=no and one where coordinates=yes.
input.file=/blue/cohn/ja.scott/filtered_maf/LEGSoR_allspecies_mergeall_5.5.maf.gz input.file.compression=gzip input.format=Maf
output.log=/blue/cohn/ja.scott/241mammals/logs/LEGSoR_allspecies_mergeall_5.5_mergeMM10_cleanup_output.log
maf.filter= \ Merge( \ species=(Mus_musculus), \ dist_max=0, \ ignore_chr=none, \ rename_chimeric_chromosomes=yes), \ MinBlockLength(min_length=30), \ OutputAlignments( \ file=/blue/cohn/ja.scott/LEG_5.5_mergeMM10.EC/elements_test/LEG_5.5mergeMM10%i.fasta, \ compression=none, \ format=Fasta, \ mask=no, \ coordinates=no) \
When I run coordinates=no, MafFilter outputs 9767 files. When I run coordinates=yes, MafFilter outputs 78243 files. I've rerun the filter multiple times, and I consistently get the same number of files for each version. The filtered maf file itself has 78229 'a' lines, which resembles the coordinates=yes file number, though is not an exact match. Excluding the sequence headers, the contents of the first 9767 files of coordinates=yes match their equivalents in coordinates=no. To me, this suggests that coordinates=no is omitting blocks from the output. Do you know why this would be?