mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
110 stars 27 forks source link

ValueError: Unexpected mismatch between header and data row #55

Closed SumeetTiwari07 closed 6 years ago

SumeetTiwari07 commented 6 years ago

hi, I am trying to run pyseer on total number of isolate 1199. the command line as follows: pyseer --phenotypes host.pheno --pres gene_presence_absence.Rtab --no-distances --cpu 20 >host_association.txt But i am encountering the same error all the time. I have cross check the isolates name and number in both the input but didn't found any difference.

First few lines are the warning and then execution and later the error arived: /lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp Read 1199 phenotypes Detected binary phenotype Traceback (most recent call last): File "/bin/pyseer", line 11, in sys.exit(main()) File "/lib/python3.6/site-packages/pyseer/main.py", line 450, in main options.cpu*options.block_size)) File "/lib/python3.6/multiprocessing/pool.py", line 274, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/lib/python3.6/multiprocessing/pool.py", line 376, in _map_async iterable = list(iterable) File "/lib/python3.6/site-packages/pyseer/input.py", line 471, in iter_variants sample_order) File "/lib/python3.6/site-packages/pyseer/input.py", line 323, in read_variant raise ValueError('Unexpected mismatch between header and data row') ValueError: Unexpected mismatch between header and data row

johnlees commented 6 years ago

Two quick questions in case this is something we might have fixed recently: Are your isolate names integers or strings? What version of pyseer are you running?

On Mon, 1 Oct 2018 at 10:52 SumeetTiwari07 notifications@github.com wrote:

hi, I am trying to run pyseer on total number of isolate 1199. the command line as follows: pyseer --phenotypes host.pheno --pres gene_presence_absence.Rtab --no-distances --cpu 20 >host_association.txt But i am encountering the same error all the time. I have cross check the isolates name and number in both the input but didn't found any difference.

First few lines are the warning and then execution and later the error arived: /lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp Read 1199 phenotypes Detected binary phenotype Traceback (most recent call last): File "/bin/pyseer", line 11, in sys.exit(main()) File "/lib/python3.6/site-packages/pyseer/main.py", line 450, in main options.cpu*options.block_size)) File "/lib/python3.6/multiprocessing/pool.py", line 274, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/lib/python3.6/multiprocessing/pool.py", line 376, in _map_async iterable = list(iterable) File "/lib/python3.6/site-packages/pyseer/input.py", line 471, in iter_variants sample_order) File "/lib/python3.6/site-packages/pyseer/input.py", line 323, in read_variant raise ValueError('Unexpected mismatch between header and data row') ValueError: Unexpected mismatch between header and data row

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/55, or mute the thread https://github.com/notifications/unsubscribe-auth/AGCdvWfVEn1aFu7ZK05jEMdQDjcN7gCNks5ugixFgaJpZM4XCPWi .

SumeetTiwari07 commented 6 years ago

1).Well my isolates basically alpha-numberics consist of hyphen (-) or underscore( _ ) in few names and few are just numbers with or without hyphen or underscore. earlier they consist of "#" too but i thought some of tools won't recognize # properly so i substituted with hyphen. 2).The version is 1.1.1

johnlees commented 6 years ago

We had a problem (fixed in fa376ae8a8347e7e68cea1f678207d53083a2c25) with treating sample names as integers if they are just a number, which sounds like it applies to your issue as you have a few which are just numbers.

This isn't in v1.1.1 and will be in a future release, so there are two options:

  1. Run git clone git@github.com:mgalardini/pyseer.git to get the latest version, and run pyseer with python pyseer-runner.py rather than pyseer.
  2. Use a workaround - add a letter before the sample names that are numeric only

Apologies for this issue. Let us know if this works or if you are still having problems.

@mgalardini I think this is the third or fourth user to run into this issue - I think we should consider a new release on pypi and conda which includes the recent fixes?

mgalardini commented 6 years ago

Yeah, that sounds like a good idea. I'll do a release today if nothing else needs to be added urgently...

SumeetTiwari07 commented 6 years ago

Even the conda version has this issue. I tried both . But Thanks a lot i will try as you suggested and will let you know if it works fine for me.

mgalardini commented 6 years ago

Hi, the conda version is the same version that you are using, so that is an expected behaviour. I just pushed a new release to pypi, so you could try pip install --upgrade pyseer to get the latest version that includes the fix.

SumeetTiwari07 commented 6 years ago

Hi, i have upgraded the seer to 1.1.2 . But unfortunately it doesn't work too the same error.

johnlees commented 6 years ago

Ah, sorry about that. Could you send your host.pheno file and the first few lines of the .Rtab file?

On Mon, 1 Oct 2018 at 14:23 SumeetTiwari07 notifications@github.com wrote:

Hi, i have upgraded the seer to 1.1.2 . But unfortunately it doesn't work too the same error.

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/55#issuecomment-426011559, or mute the thread https://github.com/notifications/unsubscribe-auth/AGCdve3DIVsHI7DoiVoQPuqgPhDeVi8bks5ugl2ygaJpZM4XCPWi .

SumeetTiwari07 commented 6 years ago

Hi please find the attachment. host.pheno is renamed as host_test_pheno.txt and another is first few lines from the roary output gene presence. Rtab gene_presence_absence_test.txt host_test_pheno.txt

johnlees commented 6 years ago

I can replicate on v1.1.2 as well. I think this is probably the int/str bug, but not caught due to the use of --no-distances. I'll try and have a full look and fix it tomorrow

mgalardini commented 6 years ago

Hi,

using the following gene presence absence file (gene_presence_absence_test1.txt) I don't get any error but I get the following output:

Read 1199 phenotypes
Detected binary phenotype
variant af      filter-pvalue   lrt-pvalue      beta    beta-std-err    intercept       notes
test0   5.05E-01        4.11E-01        4.11E-01        1.19E-01        1.45E-01        -1.45E+00
test1   5.07E-01        6.97E-01        6.97E-01        -5.63E-02       1.44E-01        -1.36E+00
test2   5.19E-01        2.19E-01        2.19E-01        -1.77E-01       1.44E-01        -1.30E+00
test3   5.17E-01        5.74E-01        5.73E-01        8.14E-02        1.45E-01        -1.43E+00
test4   4.81E-01        9.43E-01        9.43E-01        -1.04E-02       1.44E-01        -1.38E+00
test5   4.86E-01        4.98E-01        4.97E-01        -9.81E-02       1.45E-01        -1.34E+00
test6   4.94E-01        3.48E-01        3.48E-01        -1.36E-01       1.45E-01        -1.32E+00
test7   5.33E-01        9.89E-01        9.89E-01        1.96E-03        1.45E-01        -1.39E+00
test8   5.07E-01        2.34E-02        2.33E-02        -3.28E-01       1.45E-01        -1.23E+00
test9   4.85E-01        7.11E-02        7.07E-02        -2.62E-01       1.45E-01        -1.26E+00
[...]
test90  4.85E-01        9.66E-01        9.66E-01        -6.19E-03       1.44E-01        -1.38E+00
test91  4.64E-01        5.92E-01        5.92E-01        7.75E-02        1.45E-01        -1.42E+00
test92  5.23E-01        8.29E-01        8.29E-01        3.12E-02        1.45E-01        -1.40E+00
test93  4.74E-01        2.91E-01        2.91E-01        1.52E-01        1.44E-01        -1.46E+00
test94  5.09E-01        7.84E-01        7.84E-01        3.96E-02        1.44E-01        -1.41E+00
test95  4.80E-01        4.62E-01        4.61E-01        -1.07E-01       1.45E-01        -1.34E+00
test96  5.05E-01        5.93E-01        5.93E-01        7.71E-02        1.44E-01        -1.42E+00
test97  5.05E-01        4.98E-01        4.97E-01        9.80E-02        1.44E-01        -1.44E+00
test98  4.92E-01        8.74E-01        8.74E-01        -2.29E-02       1.44E-01        -1.37E+00
test99  4.89E-01        2.31E-01        2.31E-01        -1.73E-01       1.45E-01        -1.30E+00
104 loaded variants
4 filtered variants
100 tested variants
100 printed variants

Could it be something system specific?

M

SumeetTiwari07 commented 6 years ago

Ohh ok i will do one thing i will copy my whole gene_presence_absence data in another file from server to my desktop and try it again. Because sometimes i have experienced the automatic change of encoding of text in the file. But i will try it here once. Let you all know. Thank you Marchow

SumeetTiwari07 commented 6 years ago

I think the problem is my gene presence absence table (.Rtab) because when i tried with few rows like 70000 out of 78000 approx it worked fine on the same system. And when i tried with whole dataset of gene presence absence.Rtab file it showed me the error. I have to look inside the table now.

mgalardini commented 6 years ago

Hi,

maybe you could try to see which is the last gene before the error (using --cpu 1) and then inspect the rows immediately before that in your gene presence/absence file. Happy to have a look as well, if you feel like sharing your full file.

Best, M

On Tue, Oct 2, 2018 at 10:21 AM SumeetTiwari07 notifications@github.com wrote:

I think the problem is my gene presence absence table (.Rtab) because when i tried with few rows like 70000 out of 78000 approx it worked fine on the same system. And when i tried with whole dataset of gene presence absence.Rtab file it showed me the error. I have to look inside the table now.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/55#issuecomment-426206865, or mute the thread https://github.com/notifications/unsubscribe-auth/ABErX-jnzeFNPzFz4jlJ5xPnBrqDjj4-ks5ugy_-gaJpZM4XCPWi .

SumeetTiwari07 commented 6 years ago

Hi, Yeah i did the same and i found that at row 70333 in .Rtab file the gene name consist of a space ( "hdl IVa") in between the name. The program detecting it as extra column. I think in the script the column separator is a space(s) not based on the tab. That's why the length of each row are same except that one. After replacing the space with underscore it worked. I got the result.

Read 1199 phenotypes Detected binary phenotype 78718 loaded variants 56938 filtered variants 21780 tested variants 21711 printed variants

Thank you so much for yours (M) and John Lees active help.

Regards, Sumeet

mgalardini commented 6 years ago

Oh I see! Well spotted, I'll fix this so that spaces are tolerated (though I would not recommend it)

M

On Tue, 2 Oct 2018, 11:27 SumeetTiwari07, notifications@github.com wrote:

Reopened #55 https://github.com/mgalardini/pyseer/issues/55.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/mgalardini/pyseer/issues/55#event-1878931877, or mute the thread https://github.com/notifications/unsubscribe-auth/ABErX8lzpceSt1Sag9JE7msp67_bAFMcks5ugz-mgaJpZM4XCPWi .