merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Issue with anvi-get-enriched-functions-per-pan-group output #1283

Closed stephlei closed 4 years ago

stephlei commented 4 years ago

So far I have been using Aniv'o on an iMac running Mojave version 10.14.5 but I installed Anvi'o using Docker. Below is the output of anvi-self-test --version.

Anvi'o version ...............................: esther (v6-master)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Thus far I have been following the Pangenomic Tutorial. When I get close to the last steps that involve the get-gcs-of-core-functions.py I keep getting this error

Traceback (most recent call last):
  File "get-gcs-of-core-functions.py", line 30, in <module>
    main(args)
  File "get-gcs-of-core-functions.py", line 13, in main
    gcs_of_core_functions.extend(data.loc[ori_func.strip(),'gene_clusters_ids'].split(', '))
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/pandas/core/generic.py", line 4372, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'split'

I saw an earlier issue that prompted me to install pandas version 0.23.1 but this did not fix the problem. The data that is trying to be split comes from the anvi-get-enriched-functions-per-pan-group output. I saw on the blog that this is one of the functions that was updated so I'm unsure if Im using Anvi'o wrong or if this stems from the update. This is a link to the enrichment data that was produced by anvi-get-enriched-functions-per-pan-group.

ShaiberAlon commented 4 years ago

Hi @stephlei, thank you for reporting this. I will look into this tomorrow and get back to you.

ShaiberAlon commented 4 years ago

Hi @stephlei , you are using the wrong file as input for get-gcs-of-core-functions.py. Please follow all the steps here accordingly without skipping any steps: http://merenlab.org/2016/11/08/pangenomics-v2/#creating-a-quick-pangenome-with-functions

Specifically, notice that when you run anvi-get-enriched-functions-per-pan-group you need to specify (using --functional-occurrence-table-output) that you also want the functions frequency of occurrence table. So in our example we have two outputs:

  1. PROCHLORO-PAN-enriched-functions-light.txt
  2. PROCHLORO-functions-occurrence-frequency.txt

While it seems that you used output number 1, in this section of the tutorial we need the second output.

I hope this is helpful. If things don't work for you, then please include in your next report all the commands that you used so I can see exactly how you are using these ad-hoc scripts.

stephlei commented 4 years ago

So i ran through the tutorial again. I started by using file number one because that is what is listed in the step in the tutorial.

tutorial

Error from using File 1:

error

So I used your suggestion and used File 2:

differentError

I have redone this section using several different methods but the results are generally the same. When I use Anvios virtual environment I do get this error when generating the collection though Im not sure if it is meaningful.

idk

Thankyou for your help.

ShaiberAlon commented 4 years ago

Sorry about this @stephlei . It looks like this is probably my fault. I’ll try to take a look at this later this week. And apologies for my misleading message earlier!

Would you be willing to share the files you use as input for this script via my email?

stephlei commented 4 years ago

No problem. I have sent the files over to you. Thanks for the help.

ShaiberAlon commented 4 years ago

Hi @stephlei , thank you for sending it. I got it. I want to apologize that it might take me a little bit of time to test this since I am working on something with a tight deadline. I will update you as soon as I get to look into this.

ShaiberAlon commented 4 years ago

@stephlei , I wanted to update you that I found the source of the issue. But unfortunately I don't currently have the time to create a solution.

The source of the issue is that you have two functions in your enrichment output that have the same name: Chemotaxis protein CheY.

The bottom line is that this duplication is messing up the ad-hoc script get-gcs-of-core-functions.py.

I apologize that I have no solution for you as of now.

ShaiberAlon commented 4 years ago

I think I now figured out why this happening to begin with. It happens when there are two functions that have the same name, but different accession number.

Before v6, we used to use the function name when we did the statistical analysis and that guaranteed that the output would have a list of unique names, but now we use the accession number and hence there is no longer such a guarantee. I think it is not so desirable that the first column of the output of anvi-get-enriched-functions-per-pan-group does not necessarily contains unique values. What do you think @meren? Should we address that? Or is it Ok, since the output also include the accessions in the second column?

By the way, notice that I edited my comment above. What was there before was rubbish...

meren commented 4 years ago

Can we combine function name and accession id to create a 'unique' key for everything we have?

Otherwise we can't fix this problem without coming up with a very hacky design :/

ShaiberAlon commented 4 years ago

Can we combine function name and accession id to create a 'unique' key for everything we have?

That’s also what I had in mind. I’ll implement this solution soon(ish (-: ).

ghost commented 4 years ago

I've run into a similar issue when running through the pangenomics workflow and the get-gcs-of-core-functions.py. Is there a fix that uses unique function ids?

Thanks!

meren commented 4 years ago

Which version of anvi'o are you using, @laj873?

ghost commented 4 years ago

I've been using version 5.5

meren commented 4 years ago

You should try v6.2 :)