ocr cleaner has bug with gcc library / scikit image version

vsoch commented 5 years ago

The entire container libraries / base needs to be debugged, unfortunately.

>>> maybe_text = dicom.select_text_among_candidates(saved_model)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "user/__init__.py", line 122, in select_text_among_candidates
    model = cPickle.load(fin)
  File "data/__init__.py", line 29, in <module>
    from sklearn.svm import LinearSVC
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/svm/__init__.py", line 13, in <module>
    from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/svm/classes.py", line 1, in <module>
    from .base import BaseLibLinear, BaseSVC, BaseLibSVM
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/svm/base.py", line 8, in <module>
    from . import libsvm, liblinear
ImportError: /opt/anaconda2/lib/python2.7/site-packages/sklearn/svm/libsvm.so: undefined symbol: __cxa_throw_bad_array_new_length

See notes in #8

danielsnider commented 5 years ago

I've seen two people say to run: conda install libgcc [1] https://github.com/scikit-learn/scikit-learn/issues/7869#issuecomment-261098057 [2] https://stackoverflow.com/questions/42181453/sklearn-modules-on-ubuntu-oracle-virtual-box-throw-error

Would you have time to try it?

vsoch commented 5 years ago

yep!

vsoch commented 5 years ago

Lord I hope the fix is that easy, an image that doesn't reproduce when you build it again is my worst nightmare.

vsoch commented 5 years ago

It could also help to try installing sckit-learn from conda instead of pip. But I have a terrible feeling there is going to be some new conflict with nolearn (I can't remember off the top of my head why I stayed with python 2.7 in the first place but it was some dependency issue).

vsoch commented 5 years ago

Okay, here is an update! The first error was with libgfortran:

  File "<string>", line 1, in <module>
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/__init__.py", line 170, in <module>
    from . import add_newdocs
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/lib/__init__.py", line 18, in <module>
    from .polynomial import *
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/lib/polynomial.py", line 19, in <module>
    from numpy.linalg import eigvals, lstsq, inv
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/linalg/__init__.py", line 51, in <module>
    from .linalg import *
  File "/opt/anaconda2/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 29, in <module>
    from numpy.linalg import lapack_lite, _umath_linalg
ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory

I resolved with:

conda install libgfortran==1

(if you install without the version you get another error). Then I get this error about numpy versions:

/opt/anaconda2/lib/python2.7/site-packages/dask/array/numpy_compat.py:32: RuntimeWarning: divide by zero encountered in divide
  not np.allclose(np.divide(1, .5, dtype='i8'), 2) or
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "user/__init__.py", line 2, in <module>
    from skimage.io import imread
  File "/opt/anaconda2/lib/python2.7/site-packages/skimage/io/__init__.py", line 7, in <module>
    from .manage_plugins import *
  File "/opt/anaconda2/lib/python2.7/site-packages/skimage/io/manage_plugins.py", line 28, in <module>
    from .collection import imread_collection_wrapper
  File "/opt/anaconda2/lib/python2.7/site-packages/skimage/io/collection.py", line 14, in <module>
    from ..external.tifffile import TiffFile
  File "/opt/anaconda2/lib/python2.7/site-packages/skimage/external/tifffile/__init__.py", line 1, in <module>
    from .tifffile import imsave, imread, imshow, TiffFile, TiffWriter, TiffSequence
  File "/opt/anaconda2/lib/python2.7/site-packages/skimage/external/tifffile/tifffile.py", line 293, in <module>
    from . import _tifffile
RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9

And I'm still trying random numpy versions (from repos where it's reported to work) to see if it resolves.

vsoch commented 5 years ago

It's been resolving the conda enviroment for easily 5 minutes now. :/

vsoch commented 5 years ago

Is it worth trying to update the entire thing to python 3+, or is that a forest path I don't want to venture down?

danielsnider commented 5 years ago

Ack! Thank you for fighting the good fight. I wish dependency hell was a thing of the past. Need smarter python. Daniel Snider ツ

On Fri, Jan 4, 2019 at 5:12 PM Vanessa Sochat notifications@github.com wrote:

It's been resolving the conda enviroment for easily 5 minutes now. :/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydicom/dicom-cleaner/issues/9#issuecomment-451583804, or mute the thread https://github.com/notifications/unsubscribe-auth/ABqDWMT8AysmENh_Ni6yAmnJa1tJNf9lks5u_9GxgaJpZM4ZtQp7 .

danielsnider commented 5 years ago

It's a scary forest. My recent adventure down that path may help you a lot. I recently got pydicom and gdcm working in py3. Here's how: https://github.com/pydicom/pydicom/issues/331#issuecomment-450585088

On Fri, Jan 4, 2019 at 5:16 PM Vanessa Sochat notifications@github.com wrote:

Is it worth trying to update the entire thing to python 3+, or is that a forest path I don't want to venture down?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydicom/dicom-cleaner/issues/9#issuecomment-451584727, or mute the thread https://github.com/notifications/unsubscribe-auth/ABqDWPFeFIA-mQxM1kvvd05NKgK-aMitks5u_9KpgaJpZM4ZtQp7 .

vsoch commented 5 years ago

Thanks, this might help! The issue is with scikit learn but maybe a global update can resolve still...

vsoch commented 5 years ago

okay, so this won't work unless the model is rebuilt from scratch. It was built with an older sklearn, specifically even if I can get the pickle to load the _classes attribute is missing:

AttributeError: 'LinearSVC' object has no attribute 'classes_'`

This would require downloading the entire CIFAR dataset and doing over. Did you test the original image and it doesn't work for you? -> https://hub.docker.com/r/vanessa/dicom-scraper

It's dangerous to use this as a base, but we could potentially do that and install gdcm to read your images. It of course is a (long term) bad idea because we will forever be stuck with that python version, etc., but if you want a quick way to run it that might be easiest.

danielsnider commented 5 years ago

That’s sad. Sorry about that. I appreciate your smart, pragmatic advice. The original docker image for the OCR scraper didn’t like my compressed dicom images. If you can share any results showing how well the ocr scraper works that would help me consider the options. We could trade notes later next week!

Thank you again,

On Jan 4, 2019, at 6:16 PM, Vanessa Sochat notifications@github.com wrote:

okay, so this won't work unless the model is rebuilt from scratch. It was built with an older sklearn, specifically even if I can get the pickle to load the _classes attribute is missing:

AttributeError: 'LinearSVC' object has no attribute 'classes_'` This would require downloading the entire CIFAR dataset and doing over. Did you test the original image and it doesn't work for you? -> https://hub.docker.com/r/vanessa/dicom-scraper

It's dangerous to use this as a base, but we could potentially do that and install gdcm to read your images. It of course is a (long term) bad idea because we will forever be stuck with that python version, etc., but if you want a quick way to run it that might be easiest.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

vsoch commented 5 years ago

Hey I haven't lost hope - there are still two things to try!

using the original as a base image and installing gdcm
rebuilding the model

I'll try both this weekend and post an update. It would be really cool to be able to do that comparison! :)

vsoch commented 5 years ago

hey @danielsnider this isn't going to easily work unfortunately, and even rebuilding the model would require substantial refactoring that would probably require a full time effort (I do this in my free time, mostly for fun). You can likely use the old image if you can find non-gdcm images, but it's probably not worth it.

I'm generally unhappy and disappointed with this work, and wish I could allocate the time to do it over - it was literally a small weekend project I did and then nobody needed it, so I didn't work on it further. Do you think it's worth trying to plug in some newer / better OCR implementation and update the image so you have something to test against?

danielsnider commented 5 years ago

I'm generally disappointed with python dependencies! No worries tho. I've got a presentation Monday so I have to stick to my OCR implementation at the moment. I'll let you know how goes and I'll be very happy to share it nicely.

Daniel Snider ツ

On Sat, Jan 5, 2019 at 2:08 PM Vanessa Sochat notifications@github.com wrote:

hey @danielsnider https://github.com/danielsnider this isn't going to easily work unfortunately, and even rebuilding the model would require substantial refactoring that would probably require a full time effort (I do this in my free time, mostly for fun). You can likely use the old image if you can find non-gdcm images, but it's probably not worth it.

I'm generally unhappy and disappointed with this work, and wish I could allocate the time to do it over - it was literally a small weekend project I did and then nobody needed it, so I didn't work on it further. Do you think it's worth trying to plug in some newer / better OCR implementation and update the image so you have something to test against?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydicom/dicom-cleaner/issues/9#issuecomment-451682425, or mute the thread https://github.com/notifications/unsubscribe-auth/ABqDWN8c_NW5taftkGwfd5Obwy0-QoaPks5vAPgSgaJpZM4ZtQp7 .

NJ2020 commented 5 years ago

It's a scary forest. My recent adventure down that path may help you a lot. I recently got pydicom and gdcm working in py3. Here's how: pydicom/pydicom#331 (comment) … On Fri, Jan 4, 2019 at 5:16 PM Vanessa Sochat @.***> wrote: Is it worth trying to update the entire thing to python 3+, or is that a forest path I don't want to venture down? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABqDWPFeFIA-mQxM1kvvd05NKgK-aMitks5u_9KpgaJpZM4ZtQp7 .

Thanks. How do we do the same thing for Windows10?

pydicom / dicom-cleaner

ocr cleaner has bug with gcc library / scikit image version #9