trivio / common_crawl_index

Index URLs in Common Crawl
192 stars 48 forks source link

How can I get a file in text mode? #26

Open alibezz opened 9 years ago

alibezz commented 9 years ago

I am trying to copy a file in text mode, but it is not working. The URL is com.wordpress.alinebessa/2011/06/11/documenting-accerciser-first-impressions/:http

which exists in CommonCrawl. When I check it out here: http://urlsearch.commoncrawl.org/page/1346876860454/1346973204444/3513/41986721/13163

It gets loaded correctly, but this does not happen when I try to fetch it in remote_copy (method copy_arc_files) by making:

if src_key:
print src_key.get_contents_as_string(headers=headers, encoding="iso-8859-1")

It comes back to me as bytes. Can you folks please help me in retrieving the actual text? Thanks!

alibezz commented 9 years ago

Hi all,

It turns out that I ended up solving it. The bytes were coming compressed and I just decompressed them.

Cheers,