tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
760 stars 186 forks source link

WARNING scaffold97size618721 not in aiptasia_genome.bed but scaffold97size618721 exists in bed #538

Open Wongolini opened 1 year ago

Wongolini commented 1 year ago

Thanks for your help earlier - I just did not properly read the error out. The example works now, so I moved on to my own genomes. My own genomes are terrible, with very high N50s and lots of putative isoforms in genes. Could this have something to do with the failure? I ran your suggested --primary_only when coverting the gffs to bed files. Have you ever seen this error before?

(mcscan) [~/Documents/Cleves_lab/mcscan $]python3 -m jcvi.compara.catalog ortholog aiptasia_genome gfas --no_strip_names
[11:10:47] DEBUG    File `aiptasia_genome.gfas.last` found. Computation skipped.                                                                                                                      base.py:1353
           DEBUG    Assuming --qbed=aiptasia_genome.bed --sbed=gfas.bed                                                                                                                             synteny.py:496
           DEBUG    Load file `aiptasia_genome.bed`                                                                                                                                                     base.py:34
           DEBUG    Load file `gfas.bed`                                                                                                                                                                base.py:34
           DEBUG    Load BLAST file `aiptasia_genome.gfas.last` (total 3718196 lines)                                                                                                            blastfilter.py:46
           DEBUG    Load file `aiptasia_genome.gfas.last`                                                                                                                                               base.py:34
[11:10:56] WARNING  scaffold97size618721 not in aiptasia_genome.bed  
WARNING  scaffold842size31331 not in aiptasia_genome.bed                                                                                                                              
WARNING  scaffold110size576893 not in aiptasia_genome.bed                                                                                                                             
WARNING  scaffold110size576893 not in aiptasia_genome.bed                                                                                                                             
WARNING  scaffold691size61117 not in aiptasia_genome.bed                                                                                                                              
WARNING  scaffold165size444580 not in aiptasia_genome.bed                                                                                                                             
WARNING  scaffold728size46749 not in aiptasia_genome.bed                                                                                                                              
WARNING  scaffold205size394943 not in aiptasia_genome.bed                                                                                                                             
WARNING  scaffold12size1245834 not in aiptasia_genome.bed                                                                                                                             
WARNING  scaffold842size31331 not in aiptasia_genome.bed     
.
.
.

but

(mcscan) [~/Documents/Cleves_lab/mcscan $]grep scaffold97size618721 aiptasia_genome.bed | head
scaffold97size618721    289 4606    AIPGENE19630    0   -
scaffold97size618721    8148    17945   AIPGENE19596    0   -
scaffold97size618721    17805   19656   AIPGENE19625    0   -
scaffold97size618721    19795   32502   AIPGENE19635    0   +
scaffold97size618721    19795   32502   AIPGENE19636    0   +
scaffold97size618721    33714   37726   AIPGENE19607    0   -
scaffold97size618721    40408   43226   AIPGENE19609    0   -
scaffold97size618721    47725   49382   AIPGENE19613    0   +
scaffold97size618721    50176   63086   AIPGENE19617    0   -
scaffold97size618721    88245   94891   AIPGENE19634    0   +
tanghaibao commented 1 year ago

@Wongolini

Would you please check if scaffold97size618721 somehow shows up in the CDS and last output? Usually this complaint is about a gene (4th column) not found in the BED file, and not about the scaffold.

Wongolini commented 1 year ago

If this is what you are asking, then I believe the scaffold exists in the .cds file.

`(mcscan) [~/Documents/Cleves_lab/mcscan $]grep scaffold97size618721 aiptasia_genome.cds
>scaffold97size618721`

and in last:

(mcscan) [~/Documents/Cleves_lab/mcscan $]grep scaffold97size618721 aiptasia_genome.gfas.last | head
scaffold97size618721    xfSc0000357 60.10   9557    3764    13  157623  148087  65759   75286   0   3.01e+03
scaffold97size618721    xfSc0000918 59.44   2774    1116    3   124273  127040  26168   28938   1.9e-227    802
scaffold97size618721    Sc0000468   58.41   3056    1262    3   124258  127307  94724   97776   1.1e-222    786
scaffold97size618721    xfSc0000357 60.57   1291    503 1   159664  158380  63697   64987   6.4e-119    441
scaffold97size618721    xfSc0000143 57.54   1797    760 1   157220  155427  125951  127747  5.1e-117    435
scaffold97size618721    xfSc0000143 58.54   1611    656 3   155354  153747  127823  129424  2.6e-111    416
scaffold97size618721    xfSc0000143 59.47   1278    515 1   153635  152358  129542  130816  2.9e-102    386
scaffold97size618721    xfSc0000143 59.31   1214    485 2   159664  158457  123501  124711  3.6e-94 359
scaffold97size618721    xfSc0000357 59.22   1160    464 1   167009  165850  54630   55780   6.2e-89 342
scaffold97size618721    xfSc0000143 58.35   1215    506 0   149518  148304  133650  134864  4e-85   329

But it is not in last.filtered

tanghaibao commented 1 year ago

@Wongolini

Yes that scaffold sequence in the .cds file is what generated the warning. The program did not expect to find the scaffold sequences there, because it expects only gene sequences (precisely, anything present in the 4th column in the .bed file). ".cds" means coding sequences.

If you remove the scaffold sequence from the ".cds", the warning will be gone, or just ignore the warning. Since the program can't make use of it either way.

Wongolini commented 1 year ago

I see thank you. So these warnings should not necessarily be a reason why the program fails to find any anchors? I fear that my genome assemblies are too poor to have any synteny analysis done on them with MCscan - could that be the case? However, I know that there are syntenic blocks in these genomes because I have found them manually. It appears to fail even before applying cscore so I am not sure what else to tweak. Thanks for all your help!

           WARNING  too many warnings.. suppressed                                                                                                                                                                                                                      blastfilter.py:67
[11:02:33] DEBUG    running the cscore filter (cscore>=0.70) ..                                                                                                                                                                                                        blastfilter.py:108
           DEBUG    after filter (0->0) ..                                                                                                                                                                                                                             blastfilter.py:110
           DEBUG    running the local dups filter (tandem_Nmax=10) ..                                                                                                                                                                                                  blastfilter.py:115
           DEBUG    after filter (0->0) ..                                                                                                                                                                                                                             blastfilter.py:155
           DEBUG    Assuming --qbed=aiptasia_genome.bed --sbed=gfas.bed                                                                                                                                                                                                    synteny.py:496
           DEBUG    Load file `aiptasia_genome.bed`                                                                                                                                                                                                                            base.py:34
           DEBUG    Load file `gfas.bed`                                                                                                                                                                                                                                       base.py:34
           DEBUG    Load file `aiptasia_genome.gfas.last.filtered`                                                                                                                                                                                                             base.py:34
           DEBUG    A total of 0 BLAST imported from `aiptasia_genome.gfas.last.filtered`.                                                                                                                                                                                 synteny.py:383
           DEBUG    Chaining distance = 20                                                                                                                                                                                                                                synteny.py:1928
           DEBUG    Load file `aiptasia_genome.gfas.anchors`                                                                                                                                                                                                                   base.py:34
           DEBUG    A total of 0 anchor was found. Aborted.
tanghaibao commented 1 year ago

@Wongolini

Please make sure that you are using genes, not scaffold sequences in aiptasia_genome.cds. It appears that you are directly comparing scaffold sequences, which is unfortunately not what jcvi is designed to do. Consequently there are 0 anchors after filtering as shown in the logs.

Try to use gene sequences directly, e.g. AIPGENE19630 in the .cds file instead of the scaffold sequences. Then you'll have gene matches in the .last files, which jcvi synteny algorithm can operate on.