Closed afroCoderHanane closed 2 years ago
Just a quick follow up. How do I know if the task 2b worked? my docker logs look something like:
2022-05-07 07:34:59:INFO:pspacy:valid_langs=['af', 'ar', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'id', 'is', 'it', 'ja', 'kn', 'ko', 'lb', 'lij', 'lt', 'lv', 'ml', 'mr', 'nb', 'ne', 'nl', 'pl', 'pt', 'ro', 'ru', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'tt', 'uk', 'ur', 'vi', 'xx', 'yo', 'zh']
2022-05-07 07:34:59:INFO:pspacy:initializing xx
2022-05-07 07:34:59:INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
2022-05-07 07:34:59:INFO:__main__:name=process_cdx_url(url=\"usatoday.com/*\", source=\"cc\", **kwargs={})
2022-05-07 07:34:59:WARNING:root:skipping name=process_cdx_url(url="usatoday.com/*", source="cc", **kwargs={})
But I don't see any new data being added and when I run select * from metahtml_rollup_host2 order by hostpath desc limit 10;
My top hosts are very random:
url | hostpathquery | hostpath | host
-----+---------------+----------+---------------------------
86 | 86 | 86 | ru,yandex,rasp)
35 | 35 | 35 | it,virgilio)
32 | 32 | 32 | com,stitcher)
29 | 29 | 26 | gov,nih,nlm,ncbi,pubmed)
24 | 24 | 24 | org,wikipedia,eu)
24 | 24 | 24 | com,cengage,community)
23 | 23 | 23 | com,pinterest,br)
23 | 23 | 23 | com,google,developers)
22 | 22 | 22 | com,apple,music)
21 | 21 | 21 | com,avid,community-azure)
Oops, I think for me at some point for me the top host changed to google.com
, but I still have little hostpath and hostquery. cnn.com
Ran for about two hours before I lost my connection and in the log I was getting connection 403 error and I believe it is linked to my downloader_warc.py that probably still need some fix me work so your problem might be there too!
In office hours we all had the Github error while trying dowloader_host.py, I believe the error maybe link to github but I was able to bypass it by
The mistake that I had before is that I am using
cnn
instead ofcnn.com
. Hopefully that helps.