huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.86k stars 478 forks source link

Can't serialize document #82

Open minhlab opened 6 years ago

minhlab commented 6 years ago

I can save a Spacy document to disk but not one produced by neuralcoref. For example, the following snippet returns error TypeError: can't serialize My sister: [My sister, She].

import spacy

nlp0 = spacy.load('en_core_web_sm')
doc0 = nlp0(u'My sister has a dog. She loves him.')
with open(f'output/test0.pkl', 'wb') as f:
    f.write(doc0.to_bytes())

nlp = spacy.load('en_coref_sm')
doc = nlp(u'My sister has a dog. She loves him.')
with open(f'output/test.pkl', 'wb') as f:
    f.write(doc.to_bytes())

The files produced are as follows:

$ ls -lh output/*.pkl
-rw-r--r--  1 cumeo  staff     0B Aug 11 21:16 output/test.pkl
-rw-r--r--  1 cumeo  staff    16K Aug 11 21:16 output/test0.pkl
hilaw commented 5 years ago

I'm having the same issues..

thomwolf commented 5 years ago

Still an issue in the current version 4.0, I will try to fix this.

Vimos commented 5 years ago

Similar issue with directly pickling the doc.

In [1]: import spacy                                                                                  

In [2]: import neuralcoref                                                                            

In [3]: nlp = spacy.load('en_core_web_sm')                                                            

In [4]: neuralcoref.add_to_pipe(nlp)                                                                  
Out[4]: <spacy.lang.en.English at 0x7f8ef8f17dd8>

In [5]: d = nlp("NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolves coref
   ...: erence clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's
   ...:  NLP pipeline and extensible to new training datasets.")                                      

In [6]: import pickle                                                                                 

In [7]: with open('test.pt', 'wb') as f: 
   ...:     pickle.dump(d, f) 
   ...:                                                                                               
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-7-569cbf65bb25> in <module>
      1 with open('test.pt', 'wb') as f:
----> 2     pickle.dump(d, f)
      3 

doc.pyx in spacy.tokens.doc.pickle_doc()

~/anaconda3/envs/rcqa/lib/python3.7/site-packages/srsly/_pickle_api.py in pickle_dumps(data, protocol)
     12     RETURNS (bytest): The serialized object.
     13     """
---> 14     return cloudpickle.dumps(data, protocol=protocol)
     15 
     16 

~/anaconda3/envs/rcqa/lib/python3.7/site-packages/srsly/cloudpickle/cloudpickle.py in dumps(obj, protocol)
    952     try:
    953         cp = CloudPickler(file, protocol=protocol)
--> 954         cp.dump(obj)
    955         return file.getvalue()
    956     finally:

~/anaconda3/envs/rcqa/lib/python3.7/site-packages/srsly/cloudpickle/cloudpickle.py in dump(self, obj)
    282         self.inject_addons()
    283         try:
--> 284             return Pickler.dump(self, obj)
    285         except RuntimeError as e:
    286             if 'recursion' in e.args[0]:

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in dump(self, obj)
    435         if self.proto >= 4:
    436             self.framer.start_framing()
--> 437         self.save(obj)
    438         self.write(STOP)
    439         self.framer.end_framing()

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save_tuple(self, obj)
    784         write(MARK)
    785         for element in obj:
--> 786             save(element)
    787 
    788         if id(obj) in memo:

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save_dict(self, obj)
    854 
    855         self.memoize(obj)
--> 856         self._batch_setitems(obj.items())
    857 
    858     dispatch[dict] = save_dict

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in _batch_setitems(self, items)
    880                 for k, v in tmp:
    881                     save(k)
--> 882                     save(v)
    883                 write(SETITEMS)
    884             elif n:

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save_list(self, obj)
    814 
    815         self.memoize(obj)
--> 816         self._batch_appends(obj)
    817 
    818     dispatch[list] = save_list

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in _batch_appends(self, items)
    841                 write(APPENDS)
    842             elif n:
--> 843                 save(tmp[0])
    844                 write(APPEND)
    845             # else tmp is empty, and we're done

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save_dict(self, obj)
    854 
    855         self.memoize(obj)
--> 856         self._batch_setitems(obj.items())
    857 
    858     dispatch[dict] = save_dict

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in _batch_setitems(self, items)
    880                 for k, v in tmp:
    881                     save(k)
--> 882                     save(v)
    883                 write(SETITEMS)
    884             elif n:

~/anaconda3/envs/rcqa/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    522             reduce = getattr(obj, "__reduce_ex__", None)
    523             if reduce is not None:
--> 524                 rv = reduce(self.proto)
    525             else:
    526                 reduce = getattr(obj, "__reduce__", None)

span.pyx in spacy.tokens.span.Span.__reduce__()

NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.
kjaquier commented 5 years ago

Similar issue. Python 3.6.8 64bits on Anaconda, Windows 10.

>> neuralcoref.__version__
'4.0.0'
>> nlp = spacy.load('en_core_web_sm')
>> coref = neuralcoref.NeuralCoref(nlp.vocab)
>> nlp.add_pipe(coref, name='neuralcoref')
>> doc = nlp('My sister has a dog. She loves him.')
>> doc.to_disk('test.pkl')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      3 nlp.add_pipe(coref, name='neuralcoref')
      4 doc = nlp('My sister has a dog. She loves him.')
----> 5 doc.to_disk('test.pkl')

doc.pyx in spacy.tokens.doc.Doc.to_disk()

doc.pyx in spacy.tokens.doc.Doc.to_disk()

doc.pyx in spacy.tokens.doc.Doc.to_bytes()

~\AppData\Roaming\Python\Python36\site-packages\spacy\util.py in to_bytes(getters, exclude)
    580         # Split to support file names like meta.json
    581         if key.split(".")[0] not in exclude:
--> 582             serialized[key] = getter()
    583     return srsly.msgpack_dumps(serialized)
    584 

doc.pyx in spacy.tokens.doc.Doc.to_bytes.lambda8()

~\AppData\Roaming\Python\Python36\site-packages\srsly\_msgpack_api.py in msgpack_dumps(data)
     14     RETURNS (bytes): The serialized bytes.
     15     """
---> 16     return msgpack.dumps(data, use_bin_type=True)
     17 
     18 

~\AppData\Roaming\Python\Python36\site-packages\srsly\msgpack\__init__.py in packb(o, **kwargs)
     38     Pack an object and return the packed bytes.
     39     """
---> 40     return Packer(**kwargs).pack(o)
     41 
     42 

_packer.pyx in srsly.msgpack._packer.Packer.pack()

_packer.pyx in srsly.msgpack._packer.Packer.pack()

_packer.pyx in srsly.msgpack._packer.Packer.pack()

_packer.pyx in srsly.msgpack._packer.Packer._pack()

_packer.pyx in srsly.msgpack._packer.Packer._pack()

_packer.pyx in srsly.msgpack._packer.Packer._pack()

TypeError: can not serialize 'Cluster' object
stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fhamborg commented 5 years ago

I also have this issue on the most recent neuralcoref version

WillBoan commented 5 years ago

I've had this issue too, while trying to call doc_bytes = doc.to_bytes()

davidbernat commented 5 years ago

I also have this issue. Error: {TypeError}can not serialize 'spacy.tokens.span.Span' object

svlandeg commented 4 years ago

Thanks for the reports - will look into this!

AlJohri commented 4 years ago

I ran into the same issue when running nlp.pipe with multiple processes:

for doc in nlp.pipe(df.text, batch_size=5, n_process=4):
    print(doc)
AlJohri commented 4 years ago

Since this is actively blocking me, I found a temporary workaround:

    def remove_unserializable_results(doc):
        doc.user_data = {}
        for x in dir(doc._):
            if x in ['get', 'set', 'has']: continue
            setattr(doc._, x, None)
        for token in doc:
            for x in dir(token._):
                if x in ['get', 'set', 'has']: continue
                setattr(token._, x, None)
        return doc

nlp.add_pipe(remove_unserializable_results, last=True)

I added this after my last pipeline (i.e. after='coreference_resolver') which converted the coreferences into entities so I no longer needed the coref metadata which was unserializable.

petulla commented 4 years ago

Same issue in Databricks with PySpark.

dpasch01 commented 4 years ago
doc.user_data = {}

Can you please provide a more complete example. I use your code snippet but unfortunately I have no access to the coref data.

AltfunsMA commented 3 years ago

@dpasch01, the following worked for me in terms of saving at least the string representation of neuralcoref output.

def remove_unserializable_results(doc):
  temp = str(doc._.coref_resolved)
  doc.user_data = {}
  doc.user_data = {"coref": temp}
  for x in dir(doc._):
    getattr(doc._, x)
  for x in dir(doc._):
    if x in ['get', 'set', 'has', 'coref_as_ner']: continue
    setattr(doc._, x, None)
  for token in doc:
    for x in dir(token._):
      if x in ['get', 'set', 'has', 'coref_as_ner']: continue
      setattr(token._, x, None)
  return doc

nlp.add_pipe(remove_unserializable_results, last=True)

Then you can do the usual docs = nlp.pipe(my_list_of_texts) and get that string with [doc.user_data['coref'] for doc in docs].

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.