merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
443 stars 145 forks source link

anvi-profile: ValueError: could not broadcast input array from shape (5091) into shape (0) #1418

Closed domenico-simone closed 4 years ago

domenico-simone commented 4 years ago

Hi,

Sorry to come up with another issue!

As a follow-up to issue #1416 I have changed the header of my BAM file(s), sorted them and then tried to run anvi-profile. But, while anvi-profile runs, I get this error multiple times:

[27 Apr 20 09:29:43 Profiling w/20 threads] 312/5489 contigs ⚙  | MEMORY �  1.43 GB (-2.67 GB)                                                                               ETA: 1m58sProcess Process-15:
Traceback (most recent call last):
  File "/crex/proj/uppstore2018116/domenico/conda_envs/anvio-master/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/crex/proj/uppstore2018116/domenico/conda_envs/anvio-master/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/domeni/github/anvio/anvio/profiler.py", line 632, in profile_contig_worker
    contig = self.process_contig(bam_file, contig_name, contig_length)
  File "/home/domeni/github/anvio/anvio/profiler.py", line 672, in process_contig
    self.populate_gene_info_for_splits(contig)
  File "/home/domeni/github/anvio/anvio/profiler.py", line 616, in populate_gene_info_for_splits
    nt_info = self.get_gene_info_for_each_position(contig.name, info=info_of_interest)
  File "/home/domeni/github/anvio/anvio/dbops.py", line 856, in get_gene_info_for_each_position
    data[start:stop, 4] = codon_order_in_gene
ValueError: could not broadcast input array from shape (5091) into shape (0)

then the execution hangs; if I stop it with ctrl+C, I get this message:

* Anvi'o profiler received SIGINT, terminating all processes...

New data for 'layers' in data group 'default'
=====================================
Data key "total_reads_mapped" ......: Predicted type: int
Data key "num_SNVs_reported" .......: Predicted type: int
Data key "total_reads_kept" ........: Predicted type: int

NEW DATA
=====================================
Database ...........................: profile
Data group .........................: default
Data table .........................: layers
New data keys ......................: total_reads_mapped, num_SNVs_reported, total_reads_kept.

* Happy �

✓ anvi-profile took 0:09:46.819013

In case these could be relevant:

Thanks for your help!

Domenico

meren commented 4 years ago

Now this looks like either a big associated with external gene calls or missing exception handling. I'm sure @ekiefl will be happy to see someone has a case to demonstrate it. Can you please send your input files (BAM, BAM.BAI, and CONTIGS.db) to us so we can try to reproduce it?

domenico-simone commented 4 years ago

Hi @meren, thanks! Here's some links:

Contig db: https://bit.ly/2yK4TZg BAM file: https://bit.ly/357Essz BAM index: https://bit.ly/2xhnCLk

To re-create the contig db and the BAM file: Contig fasta: https://bit.ly/3cPA2ck Gene calls imported with anvi-gen-contigs-database: https://bit.ly/2y3expJ FastQ (R1): https://bit.ly/3bKbAJ8 FastQ (R2): https://bit.ly/3eXuw9B

Thank you,

Domenico

meren commented 4 years ago

Thank you very much for all these files!!

ekiefl commented 4 years ago

Thanks @domenico-simone. I can reproduce the error.

ekiefl commented 4 years ago

Hi @domenico-simone. The issue is that you have hundreds of gene calls that have start positions that are negative. Here are 4:

image

What should have happened, is anvi'o should have complained during contigs database creation. I'll try and update the code now.

ekiefl commented 4 years ago

@meren I made a branch with my best attempt, but in that branch anvi-self-test --suite mini fails. Please see for yourself. It seems that prodigal's own gene predictions yields negative values for the start position, therefore I don't know how to proceed.

cbdcba63e7d0314bfe9fef5d625557bb218a4d37

meren commented 4 years ago

I see, @ekiefl. I can see that the external gene calls file contains all those -1s. I will add a sanity check for this to do the testing much earlier.

Meanwhile, @domenico-simone, I think the script that helped you generate those gene calls needs an update :/

Can you remind us how you generated that the external gene calls file?

meren commented 4 years ago

@domenico-simone, if you pull from master anvi'o should now perform a better sanity check on your external gene calls file.

domenico-simone commented 4 years ago

Hi @meren and @ekiefl,

thanks for your prompt reply and test! I have re-run the files I had linked and I see the error (nice error message btw :D). Good to have that sanity check now!

I had used Metaeuk to predict genes on my eukaryotic contigs and I hadn't noticed that the gene starts were already 0-based, so when I converted gene calls for anvi'o I subtracted 1 to the gene start and this caused the -1s to appear.

Thank you,

Domenico