philipmulcahy / azad

amazon order history reporter chrome extension
Apache License 2.0
201 stars 71 forks source link

Order/item category extraction #169

Closed iloveitaly closed 7 months ago

iloveitaly commented 2 years ago

I'd love to be able to see which category an item was in, or most of the order was in.

Could you point me to the best place in the code to add this? I may be able to hack this in and submit a PR.

philipmulcahy commented 2 years ago

I assume you're going to be going after the category path that's called "breadcrumbs" in the html. Have you given thought to whether you want just the leaf node in that list to display, or whether you're going to want to show the list in the item cell? for example "Sports›Table Tennis›Equipment Bags" or just "Equipment Bags"? (I think I reckon the latter is unviable).

From memory, I think there are going to be at least two files you'll need to touch - item.ts and table.ts - one for adding the extraction code and the other for getting it to display.

If/when you send a PR, I have a (secret, to protect user identifiable info) test pack of data users from various countries have shared with me, and I can add a couple of test cases to it based on that existing data.

Yours,

Philip

iloveitaly commented 2 years ago

@philipmulcahy thank you for the reply!

My thought is to display the whole category breadcrumb path. That's great to know about the secret pack of test data. Thanks!

Could you help me with one more detail: how do you run the scraper against a single order page to test new logic? What's the best development loop for testing extraction of new data from an amz page?

Thanks!

iloveitaly commented 2 years ago

@philipmulcahy I was able to get a debugging setup with a single order, so I'm good there.

It seems like the only way to extract the category is to pull the product page and extract the category breadcrumb from there. It seems like the easiest way to do this is to make another request to grab the product page and then extract the data from there.

https://github.com/philipmulcahy/azad/blob/master/src/js/item.ts#L77

Two questions for you:

  1. What's the best way to access the scheduler from that point in the code?
  2. Any reason why I can't use await in the strategy* methods instead of a promise callback?
philipmulcahy commented 2 years ago

Hi @iloveitaly,

1) We could pass the scheduler from order.ts, which would mean adding a param to ItemsExtractor and then params to the various implementing strategies. Can you think of another strategy? One possible concern here is performance: more fetches == slower and more likely to trigger anti-crawling protection by Amazon, so that users who are currently getting away with their scrapes start getting blocked. For this reason, I think we should put the additional behaviour behind a feature flag controlled by the UI (another checkbox). The converted fetches should be cached in the azad cache.

2) I don't think so. Some history" I stopped being a full time coder (sigh) around the time MS dropped the await demo in C# - a good few years ago. This means I have never properly used await in any language except maybe a bit in python. This extension was originally written in schoolboy js (it was the project I used to learn the language), and was then transmuted to ts because the code base grew enough that not having static types was becoming a bit of a drag. While I was doing this, various async features were added to the language and I got a tiny bit more experienced, but I never invested in a clean-up refactor. I think that promises are probably required for some of the code (where it assembles arrays of promises and waits for them all before continuing), but if you think await works more cleanly for what you're trying to do, I'd be keen to see your approach.

Yours,

Philip

iloveitaly commented 2 years ago

@philipmulcahy I implemented the category scraping in this PR https://github.com/philipmulcahy/azad/pull/175. Agree with you on a UI option, but I don't have the time. Hopefully someone can take what I did and improve it.

philipmulcahy commented 8 months ago

d6cf6ac (heavily updated version of one of @iloveitaly's PRs, with tests and UI setting checkbox wired in) fixes this, and I just merged it into master - expect to see it in the next couple of weeks.

philipmulcahy commented 7 months ago

v.1.11.0 in review with the webstore: expect it to deploy in the next week