Closed stephlei closed 4 years ago
Hi @stephlei, thank you for reporting this. I will look into this tomorrow and get back to you.
Hi @stephlei , you are using the wrong file as input for get-gcs-of-core-functions.py
. Please follow all the steps here accordingly without skipping any steps:
http://merenlab.org/2016/11/08/pangenomics-v2/#creating-a-quick-pangenome-with-functions
Specifically, notice that when you run anvi-get-enriched-functions-per-pan-group
you need to specify (using --functional-occurrence-table-output
) that you also want the functions frequency of occurrence table. So in our example we have two outputs:
PROCHLORO-PAN-enriched-functions-light.txt
PROCHLORO-functions-occurrence-frequency.txt
While it seems that you used output number 1, in this section of the tutorial we need the second output.
I hope this is helpful. If things don't work for you, then please include in your next report all the commands that you used so I can see exactly how you are using these ad-hoc scripts.
So i ran through the tutorial again. I started by using file number one because that is what is listed in the step in the tutorial.
Error from using File 1:
So I used your suggestion and used File 2:
I have redone this section using several different methods but the results are generally the same. When I use Anvios virtual environment I do get this error when generating the collection though Im not sure if it is meaningful.
Thankyou for your help.
Sorry about this @stephlei . It looks like this is probably my fault. I’ll try to take a look at this later this week. And apologies for my misleading message earlier!
Would you be willing to share the files you use as input for this script via my email?
No problem. I have sent the files over to you. Thanks for the help.
Hi @stephlei , thank you for sending it. I got it. I want to apologize that it might take me a little bit of time to test this since I am working on something with a tight deadline. I will update you as soon as I get to look into this.
@stephlei , I wanted to update you that I found the source of the issue. But unfortunately I don't currently have the time to create a solution.
The source of the issue is that you have two functions in your enrichment output that have the same name: Chemotaxis protein CheY
.
The bottom line is that this duplication is messing up the ad-hoc script get-gcs-of-core-functions.py
.
I apologize that I have no solution for you as of now.
I think I now figured out why this happening to begin with. It happens when there are two functions that have the same name, but different accession number.
Before v6, we used to use the function name when we did the statistical analysis and that guaranteed that the output would have a list of unique names, but now we use the accession number and hence there is no longer such a guarantee. I think it is not so desirable that the first column of the output of anvi-get-enriched-functions-per-pan-group
does not necessarily contains unique values. What do you think @meren? Should we address that? Or is it Ok, since the output also include the accessions in the second column?
By the way, notice that I edited my comment above. What was there before was rubbish...
Can we combine function name and accession id to create a 'unique' key for everything we have?
Otherwise we can't fix this problem without coming up with a very hacky design :/
Can we combine function name and accession id to create a 'unique' key for everything we have?
That’s also what I had in mind. I’ll implement this solution soon(ish (-: ).
I've run into a similar issue when running through the pangenomics workflow and the get-gcs-of-core-functions.py. Is there a fix that uses unique function ids?
Thanks!
Which version of anvi'o are you using, @laj873?
I've been using version 5.5
You should try v6.2
:)
So far I have been using Aniv'o on an iMac running Mojave version 10.14.5 but I installed Anvi'o using Docker. Below is the output of anvi-self-test --version.
Thus far I have been following the Pangenomic Tutorial. When I get close to the last steps that involve the get-gcs-of-core-functions.py I keep getting this error
I saw an earlier issue that prompted me to install pandas version 0.23.1 but this did not fix the problem. The data that is trying to be split comes from the anvi-get-enriched-functions-per-pan-group output. I saw on the blog that this is one of the functions that was updated so I'm unsure if Im using Anvi'o wrong or if this stems from the update. This is a link to the enrichment data that was produced by anvi-get-enriched-functions-per-pan-group.