jhagara / TP-DeepSearch

Team 19 code base repository.
https://team19-16.studenti.fiit.stuba.sk
0 stars 1 forks source link

Chyba pri parsovani #103

Open mateee12 opened 7 years ago

mateee12 commented 7 years ago

pustil som cely rocnik 1941 na servery, po dlhom parsovani to padlo pri tomto subore, tuto je Stacktrace:

{'header_config': '/var/lib/deep_search_docs_2/slovak_1336-4464/slovak_config.json', 'xml': '/var/lib/deep_search_docs_2/slovak_1336-4464/1336-4464_1941/19411205/XML/1336-4464_1941_19411205_00001.xml', 'pdf': None}

Loaded Files:

{'json': '/var/lib/deep_search_docs_2/slovak_1336-4464/slovak_config.json', 'xml': '/var/lib/deep_search_docs_2/slovak_1336-4464/1336-4464_1941/19411205/XML/1336-4464_1941_19411205_00001.xml', 'journal_marc21': '/var/lib/deep_search_docs_2/slovak_1336-4464/journal_marc21.xml', 'dir': '/var/lib/deep_search_docs_2/slovak_1336-4464/1336-4464_1941/19411205'} Issue created, index: deep_search_prod, type: issue, id: AVt-daLNADYbJb4KNhEx Traceback (most recent call last): File "elastic_filler.py", line 72, in main(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]) File "elastic_filler.py", line 65, in main issue_id = semantic.save_to_elastic(name, file['dir'], file) File "/var/www/deep_search/python_app/helper/elastic_filler.py", line 94, in save_to_elastic max_font = max([int(head[:-1]) for head in heading_sizes] or [0]) File "/var/www/deep_search/python_app/helper/elastic_filler.py", line 94, in max_font = max([int(head[:-1]) for head in heading_sizes] or [0]) ValueError: invalid literal for int() with base 10: '10.'

FloofyReal commented 7 years ago

Snazi sa prekonvertovat cislo strany na int a evidentne head[:-1] nestaci pre urcenie integeru velkosti strany - tj. OCR nie je tak konzistentne ako sme si mysleli :D http://stackoverflow.com/questions/6903557/splitting-on-first-occurrence

Solution: int(head.split('.', 1)[1]) na miesto int(head[:-1]) skus a napis

mateee12 commented 7 years ago

Hej, presne toto je chyba, samozrejme na taketo bugy sa dojde az v produkcii ked sa parsuje velke mnozstvo, prosimta Adam, urob pull request s opravou len tejto veci, ja to hned schvalim a spustim parsovanie nove. DK