qiao-xin / DupGen_finder

A pipeline used to identify different modes of duplicated gene pairs
92 stars 29 forks source link

Outgroups seem unused #8

Closed nhartwic closed 4 years ago

nhartwic commented 4 years ago

stdout from execution below...

Reading BLAST file and pre-processing
Generating BLAST list
708823 matches imported (1542503 discarded)
2307 pairwise comparisons
7316 alignments generated
Pairwise collinear blocks written to outdir/targ.collinearity [41.574 seconds elapsed]
Reading BLAST file and pre-processing
Generating BLAST list
0 matches imported (2563444 discarded)
0 pairwise comparisons
0 alignments generated
Pairwise collinear blocks written to outdir/targ_outgroup.collinearity [8.996 seconds elapsed]

Seems like my outgroup blast file isn't working for whatever reason. Is there a some secondary filtering on the alignments happening? I've tried swapping which sample I use as query and which I use as database and it doesn't seem to matter. All the alignments get discarded either way. Any thoughts on what I could check? Is this expected output somehow?

LQHHHHH commented 4 years ago

Thanks for using DupGen_finder. It seems the gene names in your outgrou.gff is inconsistent with gene names in your outgroup.pep or targ_outgroup.blast, you should keep gene names are same in all files. If still have this problem, you can send your data via e-mail, I can check your file and test.

Qionghou Li

nhartwic commented 4 years ago

Do I need "ougroup.pep" in my working directory? Your readme makes no mention of it. As far as I can tell gene names are consistent here. I'm using the same methods to make your GFF format files for both my samples (target and outgroup) and am only having issues with outgroup. For reference, this is what my wkdir looks like.

head *
==> targ.blast <==
FUN_000001-T1   FUN_000001-T1   100.000 81      0       0       1       81      1       81      4.99e-55        163
FUN_000001-T1   FUN_019628-T1   55.882  34      15      0       48      81      200     233     0.001   35.0
FUN_000001-T1   FUN_014073-T1   54.839  31      14      0       41      71      131     161     0.014   32.0
FUN_000001-T1   FUN_017902-T1   37.500  56      33      1       20      73      1       56      0.25    28.5
FUN_000001-T1   FUN_017034-T1   42.105  38      19      1       27      61      341     378     0.74    26.9
FUN_000001-T1   FUN_002070-T1   36.538  52      25      2       27      78      3       46      1.0     26.6
FUN_000001-T1   FUN_025458-T1   26.829  41      30      0       37      77      6       46      1.1     26.6
FUN_000001-T1   FUN_010210-T1   35.294  34      22      0       30      63      141     174     1.3     26.6
FUN_000001-T1   FUN_022878-T1   31.707  41      27      1       14      53      46      86      6.0     24.6
FUN_000001-T1   FUN_018664-T1   57.143  14      6       0       29      42      94      107     6.1     24.6

==> targ.gff <==
1       FUN_000001-T1   17883   18126
1       FUN_000002-T1   18594   32866
1       FUN_000003-T1   35452   47763
1       FUN_000004-T1   54987   55406
1       FUN_000005-T1   56761   57102
1       FUN_000006-T1   63338   63799
1       FUN_000007-T1   64107   64859
1       FUN_000008-T1   65039   65311
1       FUN_000009-T1   65640   67157
1       FUN_000010-T1   68667   69137

==> targ_outgroup.blast <==
FUN_000001-T1   Aco008982.1     55.556  18      8       0       29      46      1985    2002    0.093   30.0
FUN_000001-T1   Aco023383.1     30.189  53      37      0       25      77      13      65      0.15    29.3
FUN_000001-T1   Aco013665.1     28.333  60      36      1       2       61      991     1043    1.1     26.9
FUN_000001-T1   Aco003809.1     28.846  52      32      1       6       57      199     245     1.1     26.9
FUN_000001-T1   Aco005999.1     55.000  20      9       0       22      41      69      88      1.3     26.9
FUN_000001-T1   Aco019060.1     41.667  24      14      0       29      52      78      101     1.4     26.6
FUN_000001-T1   Aco000047.1     31.579  57      34      2       5       61      321     372     1.9     26.2
FUN_000001-T1   Aco000473.1     29.167  48      28      1       35      76      186     233     2.3     26.2
FUN_000001-T1   Aco005928.1     45.455  33      12      2       24      51      17      48      2.5     26.2
FUN_000001-T1   Aco018338.1     42.857  28      16      0       24      51      41      68      4.1     25.4

==> targ_outgroup.gff <==
LG01    Aco009737.1     4423    7510
LG01    Aco009736.1     30565   33237
LG01    Aco009735.1     31068   33597
LG01    Aco009734.1     38233   41659
LG01    Aco009733.1     44128   44718
LG01    Aco009732.1     45723   48320
LG01    Aco009731.1     48135   51491
LG01    Aco009730.1     52603   60674
LG01    Aco009729.1     61619   65001
LG01    Aco009728.1     64934   80064
(base) nol@LAPTOP-3MJBSED2:~/BIO/spcyc/dup_gen_dir/wkdir$ grep Aco003809.1 targ_outgroup.gff
LG15    Aco003809.1     3906493 3912332

You can find gzipped versions of the blast and gff files that I've tried at the google drive link below...

https://drive.google.com/drive/folders/1_U2DCIATBwpv_01Cnd2FthS5sKlUq1ob?usp=sharing

....I'm using soft links to match the file patterns your tools requires. My work dir looks like the following. I've tried using both of the outgroup blasts which just vary which sample is used as query and which is used as a database. Both produced the same results....

(base) nol@LAPTOP-3MJBSED2:~/BIO/spcyc/dup_gen_dir/wkdir$ ls -lha
total 1.8M
drwxrwxrwx 1 nol nol 4.0K Sep  8 21:30 .
drwxrwxrwx 1 nol nol 4.0K Sep  8 21:31 ..
lrwxrwxrwx 1 nol nol   27 Sep  7 14:26 targ.blast -> ../sp9504.sp9504.blastp.tsv
-rwxrwxrwx 1 nol nol 849K Sep  7 14:54 targ.gff
lrwxrwxrwx 1 nol nol   24 Sep  7 19:33 targ_outgroup.blast -> ../sp9504.aco.blastp.tsv
-rwxrwxrwx 1 nol nol 891K Sep  7 16:49 targ_outgroup.gff
LQHHHHH commented 4 years ago

No, you just need files which readme described.

I have checked your files and found the file named targ_outgroup.gff.gz only contained genes in outgroup.gff but genes in targ.gff were missing in targ_outgroup.gff.gz.

As we described in readme: [target_species]_[outgroup_species].gff, a gene position file for the target_species and outgroup_species, following a tab-delimited format. That is, the targ_outgroup.gff should contain both genes in target and outgroup species. But in your file, I can only find genes in outgroup.

I have tested your data. When I cat your targ.gff and outgroup.gff to targ_outgroup.gff , I can get normal results. Files:

targ.blast  targ.gff  targ_outgroup.blast  targ_outgroup.gff

This is my stdout:

Reading BLAST file and pre-processing
Generating BLAST list
708823 matches imported (1542503 discarded)
2307 pairwise comparisons
7316 alignments generated
Pairwise collinear blocks written to ./targ.collinearity [22.718 seconds elapsed]
Reading BLAST file and pre-processing
Generating BLAST list
1289163 matches imported (1247861 discarded)
19434 pairwise comparisons
Pairwise collinear blocks written to outdir/targ_outgroup.collinearity 
nhartwic commented 4 years ago

Fair enough. I'm not sure why input needs to be duplicated but thanks for your assistance.