jsoma / data-studio-projects

12 stars 18 forks source link

[Project] Drug markets on the dark web #260

Open angelareplica opened 6 years ago

angelareplica commented 6 years ago

Please complete all of the following sections, or the ghost of Joseph Pulitzer will spookily dance around your issue! A completed version of this template can be found at https://github.com/jsoma/data-studio-projects/issues/1

Pitch

Summary

I'm interested in what drug supply/demand looks like on the dark web. Gwern.net publishes a comprehensive archive of dark net market scrapes; the downside is that it's 1.5 TB uncompressed, it's all in HTML files (hundreds per day), and the data ends in 2015. (It took my laptop a full day to download everything -- 52 gb compressed -- and to extract data for a single small marketplace.) I also found a couple of datasets from 2014-2015 and 2016, with scrapes from Agora, Hansa, and Valhalla, which were popular markets that are now defunct. In addition, I'm poking through an SQL database from Carnegie Mellon, which has anonymized data for some markets through 2017. I had hoped to scrape the largest current market, but the interface was terrible.

The Economist did exactly what I had hoped to visualize with Gwern's data. I'd like to try to mimic their format, or to come up with other ways to visualize similar data. I'll most likely stick with the datasets that I found, and/or the CMU SQL database, since Gwern's data is probably too hefty for a n00b like me to parse in the time allotted.

economist-darknet

Details

Possible headline(s):
What the Drug Trade Looks Like On The Dark Web

Data set(s): Agora data from 2014-2015 https://www.kaggle.com/philipjames11/dark-net-marketplace-drug-data-agora-20142015 Sarah Jamie Lewis's Dark Web Data Dumps (2016) https://polecat.mascherari.press/onionscan/dark-web-data-dumps CMU Database: https://arima.cylab.cmu.edu/

Code repository: https://github.com/angelareplica/data-studio/tree/master/code/05-dnm-drugs

Possible problems/fears/questions: The data isn't really up-to-date, and other people basically did all the hard work (scraping) for me. I'll try to dig into some 2017 data from the CMU database.

Work so far

Having some issues reading some of my csvs in pandas, but will look into it later. So far I've just looked at the Agora dataset. Also, my current charts look terrible! I just threw them together. I'll clean up the data and clean up the visuals for my next update.

drugs

Psychdelics: drugs_psych

Checklist

This checklist must be completed before you submit your draft.

angelareplica commented 6 years ago

Update 1:

Update

Your project content: images/words/etc

A little closer to aping the Economist.

Sales on Alphabay, the largest market by sale volume (defunct as of 2017): alphabay_sales_edited

I plan on making two more charts like the above, for the next two largest markets. Then I hope to combine the 3 and make a chart that looks more similar (circles/bubbles) to the example I put in my original post.

Any changes in direction or topic?

I've decided to focus exclusively on working with the CMU data. I'd like to chart sales over the 3 largest markets by volume of total sales (Alphabay, Silk Road 2, and Evolution).

Problems/Questions

CMU's public datasets are anonymized and a bit limited, but I've requested access to their complete databases through IMPACT. Unfortunately it takes up to 2 weeks for IMPACT to approve my account, which will be after the due date of this project! I'll have to stick with making charts like the above.

Checklist

sarahslo commented 6 years ago

well done! nice colors. here's my q for you: how did you organize the data in the doughnut chart? it's not alphabetically, it's not by data, most to least. you have these beautiful colors and they separate well but, tell me a story? Can you put a certain class of drugs together, make them one color and let me see which group is most popular? Uppers vs downers?

it does reveal what is sold and that's interesting, but is there one thing i should takeaway here? can you design this so i see that?

angelareplica commented 6 years ago

Updated my charts and re-ordered from largest to smallest. The feedback above is very helpful (thanks Sarah!), and I'd like to figure out a way to better group the drugs. In the datasets, this is as far as they're broken down, so unfortunately I can't see specific drugs. Perhaps I could remove the non-drug categories altogether to get a better picture of what drugs have been popular over time, or on which markets.

Update

Your project content: images/words/etc

alphabay_sales_edited2

evolution_sales_edited

silkroad2_sales_edited

Any changes in direction or topic?

Nope.

Problems/Questions

Detailed above.

Checklist

jsoma commented 6 years ago

Jeez, yeah, those colors went from frown city to being real real nice. And the linking of data section to data words by use of color, so solid!

I'm going to go against the feedback given so far and say largest->smallest is good in general, but in this case I'd love to see how things changed across different markets, so I'd want Cannabis to be the same color in the same general area each time, if only to allow me to draw (terrible, flawed, tough to figure out) comparisons. Maybe the first one can be sorted largest->smallest, then the next can be sorted in the same way as the first.

If you just want to throw another chart in there for kicks, I'd say go for a slope graph (or parallel coordinates graph)

image

to allow us to make comparisons between the amount of each drug sold on each. I probably just want to see more of this color scheme, to be honest.

angelareplica commented 6 years ago

Final

Project visuals/text

I haven't finished incorporating the feedback above, but I plan to! (Thank you, Sarah & Soma!) alphabay_sales_edited2 evolution_sales_edited silkroad2_sales_edited The above 3 markets combined: dnm_chart_background

Details

Headline: What The Internet's Illicit Drug Trade Looks Like After Silk Road

Published website version: https://angelareplica.github.io/ds-dnms/

Code repository: https://github.com/angelareplica/data-studio/tree/master/code/05-dnm-drugs

Final data set(s): https://github.com/angelareplica/data-studio/tree/master/code/05-dnm-drugs

What did you find to be the most difficult part of this project?

Dealing with the limitations and anonymization of the CMU datasets.

Are you satisfied with what you produced? Is there anything you would like to change or improve?

Not yet. I need to improve the website, write the copy, and take into consideration the excellent feedback above. Once I figure out how to apply the slope/parallel coordinates graph to my data, I'll include that!

Checklist