tanmaykm / CommonCrawl.jl

Interface to common crawl dataset on Amazon S3
MIT License
5 stars 5 forks source link

Blog post on CommonCrawl #2

Open ViralBShah opened 10 years ago

ViralBShah commented 10 years ago

@johnmyleswhite, @tanmaykm and I have been discussing doing a blog post on indexing, as a way to show Julia's capabilities for working with large datasets in parallel. This started with HW2 in our MIT class.

The cost of indexing the entire CommonCrawl corpus is really high with EC2, and a lot of time is spent in string operations, based on profiling we did recently. What other things could potentially be done over the entire corpus, which don't take as much time, but could show off the capabilities?

@tanmaykm has a cost spreadsheet, for which I do not have the link handy.

ViralBShah commented 10 years ago

Cc: @jeffbezanson @stefankarpinski

johnmyleswhite commented 10 years ago

Kind of off-topic, but I'd be a big fan of using the CommonCrawl examples as a testbed for developing more powerful string processing tools in Julia. As we get more people with experience in topic modeling interested in Julia, the most frequent complaint I hear is that our text analysis toolkit is much weaker than it would need to be to make Julia competitive with Java.

randyzwitch commented 8 years ago

I realize this is a really old issue, but what constitutes "cost is really high"? I have a friend who was asking about this dataset, and I was going to use the CC dataset to look at marketing tag prevalence/market share. But not if it's going to cost me $1000...

Also, do you have code that does the parallelizing? The MIT link is dead now.

tanmaykm commented 8 years ago

@randyzwitch the cost being discussed was purely for ec2 compute hours.

Metadata processing took 10 secs per archive. Estimate for 2M archives was < $500. Indexing content (code not in this package) took 60 secs, and was estimated at $2.5K for all archives. Spot instances will be 50-70% cheaper.

Since it largely depends on the nature of processing, I think the best way to assess cost would be to try processing one archive file and extrapolate from there.

randyzwitch commented 8 years ago

Thanks @tanmaykm. Looks like I shouldn't mess around on the full dataset until I'm really sure what I'm doing. Or, maybe I can propose a project at work and use their hardware.