Closed nhartwic closed 4 years ago
Thanks for using DupGen_finder
. It seems the gene names in your outgrou.gff is inconsistent with gene names in your outgroup.pep or targ_outgroup.blast, you should keep gene names are same in all files. If still have this problem, you can send your data via e-mail, I can check your file and test.
Qionghou Li
Do I need "ougroup.pep" in my working directory? Your readme makes no mention of it. As far as I can tell gene names are consistent here. I'm using the same methods to make your GFF format files for both my samples (target and outgroup) and am only having issues with outgroup. For reference, this is what my wkdir looks like.
head *
==> targ.blast <==
FUN_000001-T1 FUN_000001-T1 100.000 81 0 0 1 81 1 81 4.99e-55 163
FUN_000001-T1 FUN_019628-T1 55.882 34 15 0 48 81 200 233 0.001 35.0
FUN_000001-T1 FUN_014073-T1 54.839 31 14 0 41 71 131 161 0.014 32.0
FUN_000001-T1 FUN_017902-T1 37.500 56 33 1 20 73 1 56 0.25 28.5
FUN_000001-T1 FUN_017034-T1 42.105 38 19 1 27 61 341 378 0.74 26.9
FUN_000001-T1 FUN_002070-T1 36.538 52 25 2 27 78 3 46 1.0 26.6
FUN_000001-T1 FUN_025458-T1 26.829 41 30 0 37 77 6 46 1.1 26.6
FUN_000001-T1 FUN_010210-T1 35.294 34 22 0 30 63 141 174 1.3 26.6
FUN_000001-T1 FUN_022878-T1 31.707 41 27 1 14 53 46 86 6.0 24.6
FUN_000001-T1 FUN_018664-T1 57.143 14 6 0 29 42 94 107 6.1 24.6
==> targ.gff <==
1 FUN_000001-T1 17883 18126
1 FUN_000002-T1 18594 32866
1 FUN_000003-T1 35452 47763
1 FUN_000004-T1 54987 55406
1 FUN_000005-T1 56761 57102
1 FUN_000006-T1 63338 63799
1 FUN_000007-T1 64107 64859
1 FUN_000008-T1 65039 65311
1 FUN_000009-T1 65640 67157
1 FUN_000010-T1 68667 69137
==> targ_outgroup.blast <==
FUN_000001-T1 Aco008982.1 55.556 18 8 0 29 46 1985 2002 0.093 30.0
FUN_000001-T1 Aco023383.1 30.189 53 37 0 25 77 13 65 0.15 29.3
FUN_000001-T1 Aco013665.1 28.333 60 36 1 2 61 991 1043 1.1 26.9
FUN_000001-T1 Aco003809.1 28.846 52 32 1 6 57 199 245 1.1 26.9
FUN_000001-T1 Aco005999.1 55.000 20 9 0 22 41 69 88 1.3 26.9
FUN_000001-T1 Aco019060.1 41.667 24 14 0 29 52 78 101 1.4 26.6
FUN_000001-T1 Aco000047.1 31.579 57 34 2 5 61 321 372 1.9 26.2
FUN_000001-T1 Aco000473.1 29.167 48 28 1 35 76 186 233 2.3 26.2
FUN_000001-T1 Aco005928.1 45.455 33 12 2 24 51 17 48 2.5 26.2
FUN_000001-T1 Aco018338.1 42.857 28 16 0 24 51 41 68 4.1 25.4
==> targ_outgroup.gff <==
LG01 Aco009737.1 4423 7510
LG01 Aco009736.1 30565 33237
LG01 Aco009735.1 31068 33597
LG01 Aco009734.1 38233 41659
LG01 Aco009733.1 44128 44718
LG01 Aco009732.1 45723 48320
LG01 Aco009731.1 48135 51491
LG01 Aco009730.1 52603 60674
LG01 Aco009729.1 61619 65001
LG01 Aco009728.1 64934 80064
(base) nol@LAPTOP-3MJBSED2:~/BIO/spcyc/dup_gen_dir/wkdir$ grep Aco003809.1 targ_outgroup.gff
LG15 Aco003809.1 3906493 3912332
You can find gzipped versions of the blast and gff files that I've tried at the google drive link below...
https://drive.google.com/drive/folders/1_U2DCIATBwpv_01Cnd2FthS5sKlUq1ob?usp=sharing
....I'm using soft links to match the file patterns your tools requires. My work dir looks like the following. I've tried using both of the outgroup blasts which just vary which sample is used as query and which is used as a database. Both produced the same results....
(base) nol@LAPTOP-3MJBSED2:~/BIO/spcyc/dup_gen_dir/wkdir$ ls -lha
total 1.8M
drwxrwxrwx 1 nol nol 4.0K Sep 8 21:30 .
drwxrwxrwx 1 nol nol 4.0K Sep 8 21:31 ..
lrwxrwxrwx 1 nol nol 27 Sep 7 14:26 targ.blast -> ../sp9504.sp9504.blastp.tsv
-rwxrwxrwx 1 nol nol 849K Sep 7 14:54 targ.gff
lrwxrwxrwx 1 nol nol 24 Sep 7 19:33 targ_outgroup.blast -> ../sp9504.aco.blastp.tsv
-rwxrwxrwx 1 nol nol 891K Sep 7 16:49 targ_outgroup.gff
No, you just need files which readme described.
I have checked your files and found the file named targ_outgroup.gff.gz
only contained genes in outgroup.gff
but genes in targ.gff were missing in targ_outgroup.gff.gz
.
As we described in readme:
[target_species]_[outgroup_species].gff, a gene position file for the target_species and outgroup_species, following a tab-delimited format.
That is, the targ_outgroup.gff should contain both genes in target and outgroup species. But in your file, I can only find genes in outgroup.
I have tested your data. When I cat
your targ.gff
and outgroup.gff
to targ_outgroup.gff
, I can get normal results.
Files:
targ.blast targ.gff targ_outgroup.blast targ_outgroup.gff
This is my stdout:
Reading BLAST file and pre-processing
Generating BLAST list
708823 matches imported (1542503 discarded)
2307 pairwise comparisons
7316 alignments generated
Pairwise collinear blocks written to ./targ.collinearity [22.718 seconds elapsed]
Reading BLAST file and pre-processing
Generating BLAST list
1289163 matches imported (1247861 discarded)
19434 pairwise comparisons
Pairwise collinear blocks written to outdir/targ_outgroup.collinearity
Fair enough. I'm not sure why input needs to be duplicated but thanks for your assistance.
stdout from execution below...
Seems like my outgroup blast file isn't working for whatever reason. Is there a some secondary filtering on the alignments happening? I've tried swapping which sample I use as query and which I use as database and it doesn't seem to matter. All the alignments get discarded either way. Any thoughts on what I could check? Is this expected output somehow?