db_getClusterGeneInformation.py uses wrong columns

JamesRH commented 11 years ago

Around line 71 db_getClusterGeneInformation.py calls:

cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

These column options should be removed (the defualt is, I thik, right) or should be -g 1 -a 5 -s 12

Bug reproduction:

jamesrh@biome:~/iTEP_Matt/pipeline_treetables_orthocluster_4$ echo "all_I_1.7_c_0.1_m_maxbit805" | db_makeClusterAlignment.py   -m  mafft_default --notrim

Error returned: 
cat 786782332.tmp | db_getClusterGeneInformation.py -r 1 -c 2 | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > 786782332.fasta
Traceback (most recent call last):
  File "/data/Cluster_Files/src/db_getClusterGeneInformation.py", line 52, in <module>
    con.execute(query, (spl[rc], spl[cc]))
IndexError: list index out of range

In the past I made these changes to db_makeClusterAlignment.py, as we discussed, I was making it multiplex many cluster runs as well as fixing this bug, which is not necessary.

< cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | annoteSeq2Fasta.py -g 3 -a 5 -s 6 > %s" %(fname, rc+1, cc+1, fasta)

---
> cmd = "cat %s | db_getClusterGeneInformation.py -r %d -c %d | cut -f 1 |sort -u | db_getGeneInformation.py | annoteSeq2Fasta.py -g 1 -a 5 -s 12 > %s" %(fname, rc+1, cc+1, fasta)

mattb112885 commented 11 years ago

I fixed part of this problem (with the incorrect column numbers). What did we decide, should we limit this to just one cluster\run pair?

mattb112885 commented 11 years ago

FYI - you called this function weirdly which was what caused the exact error you obtained (there was also a separate problem with column numbers that created an invalid fasta file) - the cluster and run ID should be in separate columns on input. So I recommend doing this:

makeTabDelimitedRow.py "all_I_1.7_c_0.1_m_maxbit" 805 | db_makeClusterAlignment.py -m mafft_default --notrim

Things like this are why I created the makeTabDelimitedRow.py function.

JamesRH commented 11 years ago

Yes, I agreed that your original design was the "least surprise". One cluster pair is fine.

Perhaps it makes more sense as a command-line arg (it makes sense to separate multi-line input as a pipe and single variables as arguments I think, they can always be multiplexed in a wrapper scrip or one of my xargs monstrosities).

James H.

JamesRH commented 11 years ago

I can not tell you how many times I needed to do that today.

Now I know the function exists.

I really need to set aside an hour and read the documentation.

Here are two other ugly hacks. Do you know a better shell or do you have a program to do this?

1) I do this sort of thing when I need to reverse the columns (cut -f 3,1 gives the same output as cut -f 1,3, but I want it to be in the other order).

cat all* | cut -f 1 > tmp1; cat all* | cut -f 8 > tmp2; paste tmp2 tmp1

out;rm tmp2 tmp1

is there a better way?

2) Adding up one (or more) columns:

cat in|cut -f 1| paste -sd+ |bc

Have a good weekend, James H.

On 11/30/2012 05:25 PM, mattb112885 wrote:

FYI - you called this function weirdly which was what caused the exact error you obtained (there was also a separate problem with column numbers that created an invalid fasta file) - the cluster and run ID should be in separate columns on input. So I recommend doing this:

makeTabDelimitedRow.py "all_I_1.7_c_0.1_m_maxbit" 805 | db_makeClusterAlignment.py -m mafft_default --notrim

Things like this are why I created the makeTabDelimitedRow.py function.

— Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/23#issuecomment-10908263.

JamesRH commented 11 years ago

Sorry, my bug report sucked, I was trying to simplify it and ended up submitting it that way. In my code I was piping from a string that did pass it separated by a tab.

Sorry, James H

On 11/30/2012 05:25 PM, mattb112885 wrote:

FYI - you called this function weirdly which was what caused the exact error you obtained (there was also a separate problem with column numbers that created an invalid fasta file) - the cluster and run ID should be in separate columns on input. So I recommend doing this:

makeTabDelimitedRow.py "all_I_1.7_c_0.1_m_maxbit" 805 | db_makeClusterAlignment.py -m mafft_default --notrim

Things like this are why I created the makeTabDelimitedRow.py function.

— Reply to this email directly or view it on GitHub https://github.com/mattb112885/clusterDbAnalysis/issues/23#issuecomment-10908263.

mattb112885 commented 11 years ago

OK - I limited it to one for now but I agree it would make more sense as a command line arg (or with both options). I'm going to close this and file a new bug to that effect.

mattb112885 / clusterDbAnalysis

db_getClusterGeneInformation.py uses wrong columns #23