humlab / penelope

Pipleline for generating data used in text analytics notebooks. Used by Welfare State Analytics, INIDUN and several other research projects.
5 stars 1 forks source link

vectorize raises error if TF > 1 #173

Open roger-mahler opened 1 year ago

roger-mahler commented 1 year ago

Check compress of vocabulary!

┌─[roger@vulcan]─[~/source/penelope]  (dev *$>)
└──>  (humlab-penelope-ATiWKElI) λ PYTHONPATH=. python ./penelope/scripts/dtm/vectorize.py --tf-threshold 1 --pos-includes "|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|" --pos-paddings "|PUNCT|EOL|SPACE|" --pos-excludes "||" --to-lower --keep-symbols --keep-numerals --enable-checkpoint /data/inidun/resources/courier_article_pages.yml /mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_pages_20210921.zip /mnt/wsl/data/inidun/Courier_allpos_nolemma_tf1 Courier_allpos_nolemma_tf1
Vocab: 100%|████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:10<00:00, 66.57it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████▊| 666/667 [00:32<00:00, 29.07it/s]2022-12-23 11:59:10.104 | WARNING  | penelope.corpus.dtm.corpus:_ingest_document_index:79 - VectorizedCorpus: supplied document index has not an integral index
2022-12-23 11:59:12.851 | INFO     | __main__:process:177 - Done!
100%|███████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:36<00:00, 18.46it/s]

┌─[roger@vulcan]─[~/source/penelope]  (dev *$>)
└──>  (humlab-penelope-ATiWKElI) λ PYTHONPATH=. python ./penelope/scripts/dtm/vectorize.py --tf-threshold 10 --pos-includes "|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|" --pos-paddings "|PUNCT|EOL|SPACE|" --pos-excludes "||" --to-lower --keep-symbols --keep-numerals --enable-checkpoint /data/inidun/resources/courier_article_pages.yml /mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_pages_20210921.zip /mnt/wsl/data/inidun/Courier_allpos_nolemma_tf10 Courier_allpos_nolemma_tf10
Vocab: 100%|████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:10<00:00, 62.77it/s]
2022-12-23 12:01:03.859 | INFO     | penelope.corpus.token2id:compress:311 - Compressing vocab: TF-threshold=10 Keeping: * __low-tf__
100%|██████████████████████████████████████████████████████████████████████████████████████████▊| 666/667 [00:56<00:00, 14.42it/s]2022-12-23 12:01:50.182 | ERROR    | __main__:process:180 - column index (28382) out of bounds
Traceback (most recent call last):

  File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 186, in <module>
    main()  # pylint: disable=no-value-for-parameter
    └ <Command main>

  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7ff695456b80>
           └ <Command main>
  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7ff6956aed60>
         │    └ <function Command.invoke at 0x7ff69545b670>
         └ <Command main>
  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'tf_threshold': 10, 'pos_includes': '|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|', 'pos_...
           │   │      │    │           └ <click.core.Context object at 0x7ff6956aed60>
           │   │      │    └ <function main at 0x7ff640636700>
           │   │      └ <Command main>
           │   └ <function Context.invoke at 0x7ff695456430>
           └ <click.core.Context object at 0x7ff6956aed60>
  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'tf_threshold': 10, 'pos_includes': '|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|', 'pos_...
                       └ ()

  File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 78, in main
    process(**arguments)
    │         └ {'corpus_config': '/data/inidun/resources/courier_article_pages.yml', 'input_filename': '/mnt/wsl/data/inidun/courier/corpus/...
    └ <function process at 0x7ff69588f3a0>

> File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 175, in process
    workflow.compute(args=args, corpus_config=corpus_config)
    │        │            │                   └ CorpusConfig(corpus_name='courier_unesco', corpus_type=<CorpusType.Text: 1>, corpus_pattern='*.zip', checkpoint_opts=Checkpoi...
    │        │            └ ComputeOpts(corpus_type=<CorpusType.Text: 1>, corpus_source='/mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_p...
    │        └ <function compute at 0x7ff694c18280>
    └ <module 'penelope.workflows.vectorize.dtm' from '/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py'>

  File "/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py", line 48, in compute
    raise ex

  File "/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py", line 30, in compute
    corpus: VectorizedCorpus = (

  File "/home/roger/source/penelope/penelope/pipeline/pipeline.py", line 140, in value
    return self.single().content
           │    └ <function CorpusPipelineBase.single at 0x7ff6406b3d30>
           └ <penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0>

  File "/home/roger/source/penelope/penelope/pipeline/pipeline.py", line 136, in single
    return next(self.resolve())
                │    └ <function CorpusPipelineBase.resolve at 0x7ff6406b38b0>
                └ <penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0>

  File "/home/roger/source/penelope/penelope/pipeline/interfaces.py", line 341, in outstream
    for payload in self.process_stream():
                   │    └ <function ToDTM.process_stream at 0x7ff6406a7b80>
                   └ ToDTM(pipeline=<penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0>, in_content_type=[<ContentType.TEXT: 2>,...

  File "/home/roger/source/penelope/penelope/pipeline/dtm/tasks.py", line 37, in process_stream
    vectorized_corpus: pc.VectorizedCorpus = vectorizer.StreamVectorizer(
                       │  │                  │          └ <class 'penelope.pipeline.dtm.vectorizer.StreamVectorizer'>
                       │  │                  └ <module 'penelope.pipeline.dtm.vectorizer' from '/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py'>
                       │  └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'>
                       └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'>

  File "/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py", line 45, in vectorize_stream
    vectorized_corpus: pc.VectorizedCorpus = self.from_token_id_stream(stream)
                       │  │                  │    │                    └ [(0, 0          0
                       │  │                  │    │                      1          5
                       │  │                  │    │                      2          6
                       │  │                  │    │                      3          8
                       │  │                  │    │                      4          0
                       │  │                  │    │                              ...
                       │  │                  │    │                      6636      30
                       │  │                  │    │                      6637      30
                       │  │                  │    │                      6638    2337
                       │  │                  │    │                      663...
                       │  │                  │    └ <function StreamVectorizer.from_token_id_stream at 0x7ff6406a7a60>
                       │  │                  └ <penelope.pipeline.dtm.vectorizer.StreamVectorizer object at 0x7ff640629ca0>
                       │  └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'>
                       └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'>

  File "/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py", line 27, in from_token_id_stream
    corpus: pc.VectorizedCorpus = pc.VectorizedCorpus.from_token_id_stream(
            │  │                  │  │                └ <staticmethod object at 0x7ff6407a9430>
            │  │                  │  └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'>
            │  │                  └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'>
            │  └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'>
            └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'>

  File "/home/roger/source/penelope/penelope/corpus/dtm/corpus.py", line 464, in from_token_id_stream
    M[document_id, token_ids] = counts
    │ │            │            └ array([2202,   16,    1, ...,    1,    1,    1])
    │ │            └ array([    0,     6,     7, ..., 28480, 28481, 28482], dtype=int32)
    │ └ 22
    └ <667x28358 sparse matrix of type '<class 'numpy.int64'>'
        with 55688 stored elements in List of Lists format>

  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_lil.py", line 332, in __setitem__
    IndexMixin.__setitem__(self, key, x)
    │          │           │     │    └ array([2202,   16,    1, ...,    1,    1,    1])
    │          │           │     └ (22, array([    0,     6,     7, ..., 28480, 28481, 28482], dtype=int32))
    │          │           └ <667x28358 sparse matrix of type '<class 'numpy.int64'>'
    │          │                with 55688 stored elements in List of Lists format>
    │          └ <function IndexMixin.__setitem__ at 0x7ff66abb98b0>
    └ <class 'scipy.sparse._index.IndexMixin'>
  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_index.py", line 146, in __setitem__
    self._set_arrayXarray(i, j, x)
    │    │                │  │  └ array([2202,   16,    1, ...,    1,    1,    1])
    │    │                │  └ array([    0,     6,     7, ..., 28480, 28481, 28482], dtype=int32)
    │    │                └ array([22, 22, 22, ..., 22, 22, 22])
    │    └ <function lil_matrix._set_arrayXarray at 0x7ff66abd9700>
    └ <667x28358 sparse matrix of type '<class 'numpy.int64'>'
        with 55688 stored elements in List of Lists format>
  File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_lil.py", line 304, in _set_arrayXarray
    _csparsetools.lil_fancy_set(self.shape[0], self.shape[1],
    │             │             │    │         │    └ <property object at 0x7ff66af70b80>
    │             │             │    │         └ <667x28358 sparse matrix of type '<class 'numpy.int64'>'
    │             │             │    │                  with 55688 stored elements in List of Lists format>
    │             │             │    └ <property object at 0x7ff66af70b80>
    │             │             └ <667x28358 sparse matrix of type '<class 'numpy.int64'>'
    │             │                     with 55688 stored elements in List of Lists format>
    │             └ <built-in function lil_fancy_set>
    └ <module 'scipy.sparse._csparsetools' from '/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3...
  File "scipy/sparse/_csparsetools.pyx", line 429, in _csparsetools.lil_fancy_set
  File "scipy/sparse/_csparsetools.pyx", line 798, in _csparsetools._lil_fancy_set_int64_int64
  File "scipy/sparse/_csparsetools.pyx", line 87, in _csparsetools.lil_insert

IndexError: column index (28382) out of bounds
column index (28382) out of bounds
100%|███████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:57<00:00, 11.62it/s]