Open roger-mahler opened 1 year ago
Check compress of vocabulary!
┌─[roger@vulcan]─[~/source/penelope] (dev *$>) └──> (humlab-penelope-ATiWKElI) λ PYTHONPATH=. python ./penelope/scripts/dtm/vectorize.py --tf-threshold 1 --pos-includes "|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|" --pos-paddings "|PUNCT|EOL|SPACE|" --pos-excludes "||" --to-lower --keep-symbols --keep-numerals --enable-checkpoint /data/inidun/resources/courier_article_pages.yml /mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_pages_20210921.zip /mnt/wsl/data/inidun/Courier_allpos_nolemma_tf1 Courier_allpos_nolemma_tf1 Vocab: 100%|████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:10<00:00, 66.57it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████▊| 666/667 [00:32<00:00, 29.07it/s]2022-12-23 11:59:10.104 | WARNING | penelope.corpus.dtm.corpus:_ingest_document_index:79 - VectorizedCorpus: supplied document index has not an integral index 2022-12-23 11:59:12.851 | INFO | __main__:process:177 - Done! 100%|███████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:36<00:00, 18.46it/s] ┌─[roger@vulcan]─[~/source/penelope] (dev *$>) └──> (humlab-penelope-ATiWKElI) λ PYTHONPATH=. python ./penelope/scripts/dtm/vectorize.py --tf-threshold 10 --pos-includes "|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|" --pos-paddings "|PUNCT|EOL|SPACE|" --pos-excludes "||" --to-lower --keep-symbols --keep-numerals --enable-checkpoint /data/inidun/resources/courier_article_pages.yml /mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_pages_20210921.zip /mnt/wsl/data/inidun/Courier_allpos_nolemma_tf10 Courier_allpos_nolemma_tf10 Vocab: 100%|████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:10<00:00, 62.77it/s] 2022-12-23 12:01:03.859 | INFO | penelope.corpus.token2id:compress:311 - Compressing vocab: TF-threshold=10 Keeping: * __low-tf__ 100%|██████████████████████████████████████████████████████████████████████████████████████████▊| 666/667 [00:56<00:00, 14.42it/s]2022-12-23 12:01:50.182 | ERROR | __main__:process:180 - column index (28382) out of bounds Traceback (most recent call last): File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 186, in <module> main() # pylint: disable=no-value-for-parameter └ <Command main> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7ff695456b80> └ <Command main> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7ff6956aed60> │ └ <function Command.invoke at 0x7ff69545b670> └ <Command main> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'tf_threshold': 10, 'pos_includes': '|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|', 'pos_... │ │ │ │ └ <click.core.Context object at 0x7ff6956aed60> │ │ │ └ <function main at 0x7ff640636700> │ │ └ <Command main> │ └ <function Context.invoke at 0x7ff695456430> └ <click.core.Context object at 0x7ff6956aed60> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) │ └ {'tf_threshold': 10, 'pos_includes': '|ADJ|ADV|INTJ|PART|CONJ|CCONJ|SCONJ|NOUN|PROPN|NUM|AUX|SYM|X|ADP|DET|PRON|VERB|', 'pos_... └ () File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 78, in main process(**arguments) │ └ {'corpus_config': '/data/inidun/resources/courier_article_pages.yml', 'input_filename': '/mnt/wsl/data/inidun/courier/corpus/... └ <function process at 0x7ff69588f3a0> > File "/home/roger/source/penelope/./penelope/scripts/dtm/vectorize.py", line 175, in process workflow.compute(args=args, corpus_config=corpus_config) │ │ │ └ CorpusConfig(corpus_name='courier_unesco', corpus_type=<CorpusType.Text: 1>, corpus_pattern='*.zip', checkpoint_opts=Checkpoi... │ │ └ ComputeOpts(corpus_type=<CorpusType.Text: 1>, corpus_source='/mnt/wsl/data/inidun/courier/corpus/courier_issue_only_article_p... │ └ <function compute at 0x7ff694c18280> └ <module 'penelope.workflows.vectorize.dtm' from '/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py'> File "/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py", line 48, in compute raise ex File "/home/roger/source/penelope/penelope/workflows/vectorize/dtm.py", line 30, in compute corpus: VectorizedCorpus = ( File "/home/roger/source/penelope/penelope/pipeline/pipeline.py", line 140, in value return self.single().content │ └ <function CorpusPipelineBase.single at 0x7ff6406b3d30> └ <penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0> File "/home/roger/source/penelope/penelope/pipeline/pipeline.py", line 136, in single return next(self.resolve()) │ └ <function CorpusPipelineBase.resolve at 0x7ff6406b38b0> └ <penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0> File "/home/roger/source/penelope/penelope/pipeline/interfaces.py", line 341, in outstream for payload in self.process_stream(): │ └ <function ToDTM.process_stream at 0x7ff6406a7b80> └ ToDTM(pipeline=<penelope.pipeline.pipelines.CorpusPipeline object at 0x7ff640629dc0>, in_content_type=[<ContentType.TEXT: 2>,... File "/home/roger/source/penelope/penelope/pipeline/dtm/tasks.py", line 37, in process_stream vectorized_corpus: pc.VectorizedCorpus = vectorizer.StreamVectorizer( │ │ │ └ <class 'penelope.pipeline.dtm.vectorizer.StreamVectorizer'> │ │ └ <module 'penelope.pipeline.dtm.vectorizer' from '/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py'> │ └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'> └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'> File "/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py", line 45, in vectorize_stream vectorized_corpus: pc.VectorizedCorpus = self.from_token_id_stream(stream) │ │ │ │ └ [(0, 0 0 │ │ │ │ 1 5 │ │ │ │ 2 6 │ │ │ │ 3 8 │ │ │ │ 4 0 │ │ │ │ ... │ │ │ │ 6636 30 │ │ │ │ 6637 30 │ │ │ │ 6638 2337 │ │ │ │ 663... │ │ │ └ <function StreamVectorizer.from_token_id_stream at 0x7ff6406a7a60> │ │ └ <penelope.pipeline.dtm.vectorizer.StreamVectorizer object at 0x7ff640629ca0> │ └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'> └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'> File "/home/roger/source/penelope/penelope/pipeline/dtm/vectorizer.py", line 27, in from_token_id_stream corpus: pc.VectorizedCorpus = pc.VectorizedCorpus.from_token_id_stream( │ │ │ │ └ <staticmethod object at 0x7ff6407a9430> │ │ │ └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'> │ │ └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'> │ └ <class 'penelope.corpus.dtm.corpus.VectorizedCorpus'> └ <module 'penelope.corpus' from '/home/roger/source/penelope/penelope/corpus/__init__.py'> File "/home/roger/source/penelope/penelope/corpus/dtm/corpus.py", line 464, in from_token_id_stream M[document_id, token_ids] = counts │ │ │ └ array([2202, 16, 1, ..., 1, 1, 1]) │ │ └ array([ 0, 6, 7, ..., 28480, 28481, 28482], dtype=int32) │ └ 22 └ <667x28358 sparse matrix of type '<class 'numpy.int64'>' with 55688 stored elements in List of Lists format> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_lil.py", line 332, in __setitem__ IndexMixin.__setitem__(self, key, x) │ │ │ │ └ array([2202, 16, 1, ..., 1, 1, 1]) │ │ │ └ (22, array([ 0, 6, 7, ..., 28480, 28481, 28482], dtype=int32)) │ │ └ <667x28358 sparse matrix of type '<class 'numpy.int64'>' │ │ with 55688 stored elements in List of Lists format> │ └ <function IndexMixin.__setitem__ at 0x7ff66abb98b0> └ <class 'scipy.sparse._index.IndexMixin'> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_index.py", line 146, in __setitem__ self._set_arrayXarray(i, j, x) │ │ │ │ └ array([2202, 16, 1, ..., 1, 1, 1]) │ │ │ └ array([ 0, 6, 7, ..., 28480, 28481, 28482], dtype=int32) │ │ └ array([22, 22, 22, ..., 22, 22, 22]) │ └ <function lil_matrix._set_arrayXarray at 0x7ff66abd9700> └ <667x28358 sparse matrix of type '<class 'numpy.int64'>' with 55688 stored elements in List of Lists format> File "/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3.9/site-packages/scipy/sparse/_lil.py", line 304, in _set_arrayXarray _csparsetools.lil_fancy_set(self.shape[0], self.shape[1], │ │ │ │ │ └ <property object at 0x7ff66af70b80> │ │ │ │ └ <667x28358 sparse matrix of type '<class 'numpy.int64'>' │ │ │ │ with 55688 stored elements in List of Lists format> │ │ │ └ <property object at 0x7ff66af70b80> │ │ └ <667x28358 sparse matrix of type '<class 'numpy.int64'>' │ │ with 55688 stored elements in List of Lists format> │ └ <built-in function lil_fancy_set> └ <module 'scipy.sparse._csparsetools' from '/home/roger/.cache/pypoetry/virtualenvs/humlab-penelope-ATiWKElI-py3.9/lib/python3... File "scipy/sparse/_csparsetools.pyx", line 429, in _csparsetools.lil_fancy_set File "scipy/sparse/_csparsetools.pyx", line 798, in _csparsetools._lil_fancy_set_int64_int64 File "scipy/sparse/_csparsetools.pyx", line 87, in _csparsetools.lil_insert IndexError: column index (28382) out of bounds column index (28382) out of bounds 100%|███████████████████████████████████████████████████████████████████████████████████████████| 667/667 [00:57<00:00, 11.62it/s]
Check compress of vocabulary!