redpanda-ai / Meerkat

Used for the Meerkat project
Other
1 stars 1 forks source link

Geomancer Sub-project #822

Open redpanda-ai opened 8 years ago

redpanda-ai commented 8 years ago

Definition of Success

Here is a table showing our target accuracy metrics for our two _addressable markets_, which assume that the merchant name is already known:

Metric City + State City + State + Zip
Precision 99% 99%
Recall 90% 80%

What we need for ground truth

vnagarajY commented 8 years ago

location for data - Ground truth https://console.aws.amazon.com/s3/home?region=us-west-2&bucket=s3yodlee&prefix=meerkat/cnn/data/merchant/

use the latest folder under bank/card - based on date

vnagarajY commented 8 years ago

start of with Starbucks and I will get 20 list soon

vnagarajY commented 8 years ago

Ace Hardware Walmart Walgreens Target Subway Starbucks Safeway McDonald's Costco Burger King Bed Bath & Beyond Aeropostale Albertsons American Eagle Outfitters Applebee's Arby's AutoZone Bahama Breeze Barnes & Noble Baskin-Robbins Bealls Eddie V's Fedex Five Guys Food 4 Less Francesca's Fred Meyer Gymboree H&M Home Depot IHOP In-N-Out Burger J. C. Penney KFC Kmart Kohl's LongHorn Steakhouse Lowe's Macy's Nordstrom

Initial Target Merchant list

redpanda-ai commented 8 years ago

Okay, each of these is located in our most recent merchant label_map.json. Once our sample is ready, we'll need to run the sample through Meerkat's Merchant CNN so that we can get some proper analysis and Pybossa labeling tasks.

diwu001 commented 8 years ago

Top merchants without store address: Nordstrom, Francesca’s, H&M, Ace Hardware, LongHorn Steakhouse, American Eagle Outfitters, Albertsons, AutoZone, J. C. Penney, Kmart, Aeropostale, Applebee’s, Baskin-Robbins, Subway, Barnes & Noble, Five Guys, Bealls, Macy’s, Lowe’s, Eddie V’s, In-N-Out Burger, Fedex, Home Depot

Top merchants with store address: McDonald’s, KFC, Bahama Breeze, Costco Wholesale Corp., IHOP, Fred Meyer, Walmart, Starbucks, Kohl’s, Walgreens, Target, Burger King, Bed Bath and Beyond, Food 4 Less, Arby’s, Gymboree

vnagarajY commented 8 years ago

Updated file to fix data missing issue: https://s3-us-west-2.amazonaws.com/s3yodlee/meerkat/AggData/2016-08-16-top-merchants-store+(1).zip

anyway we can add the ones not in the list from this ten - we got the request from a specific prospect - few of them may be in the 40

· AUTOZONE, · BEST BUY CO INC, · DOMINO'S PIZZA, · HOME DEPOT, · LULULEMON, · O'REILLY AUTOMOTIVE INC, · PAPA JOHNS, · SALLY BEAUTY HOLDINGS INC, · TARGET CORPORATION, · ULTA SALON COSMETICS & FRAGRANCE INC

vnagarajY commented 8 years ago

updated files https://s3-us-west-2.amazonaws.com/s3yodlee/meerkat/AggData/top40merchantsone+perfile.zip

redpanda-ai commented 7 years ago

Notes

These notes will help us get back on track once we have solved the immediate concern with building a solution for Voldemort.

We built:

  1. automation to grab agg data from S3
  2. a script to pull simple random samples from the Hadoop data warehouse using Pig
  3. a method to deduplicate transaction descriptions from our samples.
  4. a template in Pybossa which allows us to collect Ground Truth Labels.
  5. the means to automatically generate new Pybossa projects and add new tasks
  6. a way to retrieve task runs from a Pybossa project
  7. an architecture to allow us to test new models for accuracy and produce classification reports plus show us transactions that were mis-labeled.
  8. an updated method which uses a trie, plus a dictionary to produce 98%+ accuracy (f1-measure) for Target card, based upon over 600 unanimously independently labeled samples.
  9. a way to use the Merchant CNN and pandas to divide the SRS sample into separate per-merchant dataframes.

We wish to:

  1. Build a framework to automatically ingest updates to the ground truth and re-run all the models which should be re-run
  2. Present an up-to-date dashboard showing predictive qualities for each model over time.
  3. To expand auto-load to include models produced in this fashion
  4. Regularly export the ground truth to S3 (we'll need to draft a design for the S3 paths)
  5. Develop a schedule for the labeling associates so that they can produce at least 1000 unanimously independently labeled GT for each combination of top merchant (see list above) x (bank or card)
  6. We also want a mechanism to automatically update our Agg Data
  7. Introduce others to the idea of using this system so that PyBossa templates, new models that use our Geomancer class, becomes the standard for all model development.
  8. I want to set up a demonstration in front of a larger audience, including our business associates and VP level executives.

Shortcomings:

  1. Pylint violations
  2. A total of 0 unit tests