What's the word limit for the model? #18

ReichYang commented 1 year ago

Hi, I'm trying to parse some texts that is pretty long. I run into this error.

AssertionError                            Traceback (most recent call last)
Cell In[47], line 1
----> 1 restored_text=df.loc[df['unpunc'] == True, 0].map(model.restore_punctuation)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/series.py:4539, in Series.map(self, arg, na_action)
   4460 def map(
   4461     self,
   4462     arg: Callable | Mapping | Series,
   4463     na_action: Literal["ignore"] | None = None,
   4464 ) -> Series:
   4465     """
   4466     Map values of Series according to an input mapping or function.
   4537     dtype: object
   4538     """
-> 4539     new_values = self._map_values(arg, na_action=na_action)
   4540     return self._constructor(new_values, index=self.index).__finalize__(
   4541         self, method="map"
   4542     )

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/core/base.py:890, in IndexOpsMixin._map_values(self, mapper, na_action)
    887         raise ValueError(msg)
    889 # mapper is a function
--> 890 new_values = map_f(values, mapper)
    892 return new_values

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/pandas/_libs/lib.pyx:2924, in pandas._libs.lib.map_infer()

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:21, in PunctuationModel.restore_punctuation(self, text)
     20 def restore_punctuation(self,text):        
---> 21     result = self.predict(self.preprocess(text))
     22     return self.prediction_to_text(result)

File /work/reddit-policomp/miniconda3/envs/my-py310env/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py:49, in PunctuationModel.predict(self, words)
     47 text = " ".join(batch)
     48 result = self.pipe(text)      
---> 49 assert len(text) == result[-1]["end"], "chunk size too large, text got clipped"
     51 char_index = 0
     52 result_index = 0

AssertionError: chunk size too large, text got clipped

I didn't use any other config, just the default model and predict function. It looks like the texts is too long or the chunk_size is too long (which I didn't configure)? Is there anything I should do to have it properly function?

oliverguhr commented 1 year ago

Hi @ReichYang, the model can only handle 512 tokens. Since one word is not always one token, the code splits the input text by whitespace, uses 230 elements (read words) for each inference step. This way we can iterate over long texts.

However, if your input contains many uncommon words like names of chemical compounds etc. then even 230 words can translate into more than 512 tokens.

Can you paste a sample of your input data that leads to this issue?

ReichYang commented 1 year ago

Hi @ReichYang, the model can only handle 512 tokens. Since one word is not always one token, the code splits the input text by whitespace, uses 230 elements (read words) for each inference step. This way we can iterate over long texts.

However, if your input contains many uncommon words like names of chemical compounds etc. then even 230 words can translate into more than 512 tokens.

Can you paste a sample of your input data that leads to this issue?

Thanks for the reply. It makes sense. My data is podcast transcribed from some AI tools. This is the specific text that encountered the problem:

oliverguhr commented 10 months ago

Look at the new parameter introduced by the patch above and try to reduce the chunk_size. This should help.