jsoma / data-studio-projects

12 stars 18 forks source link

[Project] MTA Lost and Found #176

Open Weihua4455 opened 6 years ago

Weihua4455 commented 6 years ago

Pitch

Have you ever lost anything on MTA and never saw it again, ever?

If that's the case, you are not the only one. The MTA maintains a database of all lost items they found on the subway, waiting sadly to be reclaimed by their owners. Apparently they update the list every hour, and I'm curious what are some of the most common stuff that people left on the subway.

Full discolsure:FiveThirtyEight did the exact same story in 2014, and Tim Wallace from the New York Times was the one who found the data.

But hey, it's been four years. Plus, there is nothing wrong with standing on the giants' shoulders.

Summary

F.Y.I. If you did lose something recently, here is where you can file an inquiry:

http://lostfound.mtanyct.info/lostfound/LPURiderOnlineInquiryEntry.aspx

Details

Possible headline(s):

"Top 10 Things People Left on MTA" "10 Things New Yorkers Let on MTA ALL THE TIME" "Lost and Never Found" --- which is not entirely accurate.

Data set(s):

http://advisory.mtanyct.info/LPUWebServices/CurrentLostProperty.aspx

Code repository:

https://github.com/Weihua4455/data_studio/tree/master/02_mta_lost_n_found

Possible problems/fears/questions:

Fear: I will never make it as good as FiveThirtyEight ...?

Also, I want to find a new angle to the story. For example, can I track down some people who lost stuff on MTA and never found it? Maybe through scraping social media? Can I actually help them find it? Or add a human element to the story some other way?

During my research, I came across a system that MTA uses for auction off some of the lost items that are out of retention period. I wonder 1) How much money they make out of it? 2) Can I get THAT data?

Work so far

First we scrape. (Or I do.)

The lost & found database that MTA maintains is in XML format, like so:

xml

Knowing nothing about XML, I spent a lot of some time searching through StackOverflow and reading beautiful soup documentation. Turns out all I needed to do was to pip install lxml. Obviously.

Then the rest is easy. My dataframe looks like this:

df

Just ... How does someone lose "Wall and Window Covering" on the subway?

Anyway, since MTA is supposed to update this database every hour, I thought it might be worth it to run the scraping code in server every hour, then save each dataframe using timestamp.

filename

I did a test run on the server, and everything seems to work smoothly. Still, there is always a chance that the server is going to scream at me at some point with error messages. I'll let it run for a day to see if changes in the data are significant enough.

That said, I plotted with the first dataset I scraped, here are the ten things that New Yorkers lose on MTA all the time:

10

Just imagine this: 50,000 cell phones are sitting underground, in the MTA office at the 34th St. Station, collecting dust.

And! Please do let me know if I'm writing too much.

Checklist

jsoma commented 6 years ago

Fear: I will never make it as good as FiveThirtyEight ...?

That's 100% guaranteed! The worst thing you can ever do is look at anything anyone else has done on the same topic.

You can also use something like this tool (probably) to convert your XML to JSON, which you could then data = json.loads(varname). XML is the worst.

Just imagine this: 50,000 cell phones are sitting underground, in the MTA office at the 34th St. Station, collecting dust.

How can we turn that into a graphic? A long bar isn't going to be nearly as exciting. How can we make people relate to it, since it's soooo close to our everyday experience? How frequently do people have to be losing phones for us to end up in this situation?

playfairbot commented 6 years ago

Greetings! I'm a little robot, beep beep boop boop.

Please post your first revision! It should be posted by Thursday at midnight. More details available here.

You need some feedback, let me summon @ElinaMak, @adrianblanco, @Yuanqi-Hong for you

Yuanqi-Hong commented 6 years ago

It's funny how the things people leave behind are those that're most important. Maybe instead of standing on FiveThirtyEight's shoulder you should stand on their head, i.e. change the kind of your plot from bar to barh since it's gonna be a long list.

adrianblanco commented 6 years ago

You are pretty limited due to the data is not very detailed. What about if you try to show the total value of all these items per category? Then, you'll show a different perspective of the story. I have checked if the data details where the items where found but it looks like there is no way to know that... Sorry for not being very helpful

Weihua4455 commented 6 years ago

Update

Your project content: images/words/etc

This is where I cry.

So after the last draft, I realized the limitation to my data, so I tried to be creative and make the limited data interesting.

1) First I scraped MTA's data for a week, hoping that the number will change (it's supposed to be updated hourly), and that the pattern will tell an interesting story. Guess what? MTA didn't update their data, at all. I went through 168 csvs (cause 7 days * 24 hour/day = 168) and found that the number DID NOT CHANGE.

2) Then, I thought maybe it will be interesting to compare lost & found data among major U.S. cities, you know, Boston's MBTA, D.C.'s Metro, etc. But they don't keep similar database online. When I called some metro system's lost & found office, I was told that they can give me the aggregate data ... in 2 weeks.

3) So I thought: what if I look for lost and found data from other transportation sectors in NYC? You know, the airports, taxis, Ubers, etc. That thought took me to strange places, like the Department of Home Security's TSA Claims Data, or this website where New Yorkers post what they lost on Yellow Cabs, hoping, desperately, that the driver will see it and return their lost items. After painstakingly scraping and cleaning these data, I came to a painful conclusion that the number from Yellow Cabs and airports are simply too small to be compared with MTA's data. Life is tough.

4) Finally, I started pulling my hair, thinking how can creatively I work with the data I have. Then it came to me (kind of): WHAT IF I estimate the length of each item, calculate how long the total length will be if I line them up side-by-side, and what's that like in comparison with MTA trains? i.e. if I line up all the wallets in MTA's lost and found department, it will be the same length as x number of MTA trains.

And that's what I did.

First, estimate length.

image

Then, create a dataframe.

image

Calculate total length and "train length"

image

Graph! (They are MTA colors because MTA is fun!)

image

I want to clean it up in illustrator, then, idk, maybe I will do something corny, like replacing the bars with MTA trains? Something like this:

image

Don't judge.

Any changes in direction or topic?

Oh...yes.

Problems/Questions

1) Can I label each bar with the amount of items (which is also in my dataframe)? Here is my dataframe:

image

2) Any suggestions of how I can make this more interesting? I was thinking of graph this on a real MTA map, like replace certain segement of tracks with a bunch of phone emojis (or not), and write something like "if you like up all the phones, it's the same distance between Columbus Circle and Time Square."

Checklist

vpenney commented 6 years ago

Nice topic, Weihua! This one is tough because the dataset only has counts of items, but no information on how long MTA has been hanging on to those things or anything else that you can sort by. You basically have to do counts.

One thing that FiveThirtyEight did that could help (if you have time) is they reached out to the good people of the MTA Lost and Found. I wonder if you can just call them up and ask them about how long things tend to stay there, or if they have any annual reports on their lost and found inventory. Like, are there a ton of Motorola Razor phones from ten years ago hanging out down there?

You could also take a look at subsets of data, such as the most popular musical instruments that are left on subways, or maybe take a look at the ten largest items commonly left on subways (TVs are baffling to me).

I think that you're on the right track (no pun intended) with looking at the length of objects--adding outside information that you know about the items help adds a new spin on the story. Value of the lost items would also be interesting to look at, but there's no way of knowing how many of those phone are really old and worth practically nothing, vs how many are relatively new. I guess you could go by retail value?

Katerinavts commented 6 years ago

Great topic, limited data! Yes, ideally you could do some traditional reporting and add more information from the interviews with MTA employees. Sometimes I wonder if we can add some ' offline' datasets to our projects (i.e. datasets we create from interviews). I think you should go for it.

Some questions that come to mind as a reader of your pitch:

-- Where do people most often report that they think they lost an item (is it Times Square etc?) -- What is the average time of retrieval of a found item? -- In how many stations can I return a lost item?

Following up on the recommendation above, adding retail value to items is a good idea for another graph or just annotation.