mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

Hint: Task 2b #212

Closed afroCoderHanane closed 2 years ago

afroCoderHanane commented 2 years ago

In office hours we all had the Github error while trying dowloader_host.py, I believe the error maybe link to github but I was able to bypass it by

 Note that the command above lists the hosts in key syntax form, and you'll have to convert the host into standard form.

The mistake that I had before is that I am using cnn instead of cnn.com. Hopefully that helps.

ohorban commented 2 years ago

Just a quick follow up. How do I know if the task 2b worked? my docker logs look something like:

2022-05-07 07:34:59:INFO:pspacy:valid_langs=['af', 'ar', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'id', 'is', 'it', 'ja', 'kn', 'ko', 'lb', 'lij', 'lt', 'lv', 'ml', 'mr', 'nb', 'ne', 'nl', 'pl', 'pt', 'ro', 'ru', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'tt', 'uk', 'ur', 'vi', 'xx', 'yo', 'zh']
2022-05-07 07:34:59:INFO:pspacy:initializing xx
2022-05-07 07:34:59:INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
2022-05-07 07:34:59:INFO:__main__:name=process_cdx_url(url=\"usatoday.com/*\", source=\"cc\", **kwargs={})
2022-05-07 07:34:59:WARNING:root:skipping name=process_cdx_url(url="usatoday.com/*", source="cc", **kwargs={})

But I don't see any new data being added and when I run select * from metahtml_rollup_host2 order by hostpath desc limit 10; My top hosts are very random:

url | hostpathquery | hostpath |           host            
-----+---------------+----------+---------------------------
  86 |            86 |       86 | ru,yandex,rasp)
  35 |            35 |       35 | it,virgilio)
  32 |            32 |       32 | com,stitcher)
  29 |            29 |       26 | gov,nih,nlm,ncbi,pubmed)
  24 |            24 |       24 | org,wikipedia,eu)
  24 |            24 |       24 | com,cengage,community)
  23 |            23 |       23 | com,pinterest,br)
  23 |            23 |       23 | com,google,developers)
  22 |            22 |       22 | com,apple,music)
  21 |            21 |       21 | com,avid,community-azure)
afroCoderHanane commented 2 years ago

Oops, I think for me at some point for me the top host changed to google.com, but I still have little hostpath and hostquery. cnn.com Ran for about two hours before I lost my connection and in the log I was getting connection 403 error and I believe it is linked to my downloader_warc.py that probably still need some fix me work so your problem might be there too!