nwihardjo / Java-Webscrapper

Amazon and craigslist multi-threads webscraper
https://nwihardjo.github.io/Java-Webscrapper/
2 stars 2 forks source link

COMP3111: Software Engineering Project - Webscrapper

Group Name: #35-SHIBE

Member Task Task
albertparedandan Basic 4 Basic 6
hanifdean Basic 1 Basic 5
nwihardjo Basic 2 Basic 3

Assumptions

  1. Use amazon.com as addition reselling portal
  2. Price shown is in USD. Price as 0 will be used if no information regarding the price is available
  3. Posted date timezone is HKT (Hong Kong Time) (TODO: check what shown in the posted date for null)
  4. Pagination of amazon portal is not handled
  5. Keyword which return whole new sub-section on amazon, i.e. book, is not handled since it is not specific enough which does not return solely list of available items in the portal. The result will return nothing in this case
  6. Item listed without any title / name will not be scraped as it is not a valid item
  7. Main price of amazon item is used, not the 'more buying options' or 'offer price' (usually cheaper price of same item listed in the portal from different seller). Average of the main price is used when the main price is a range between two prices (usually due to different sizes, colours, etc). Cheapest 'more buying options' or 'offer' price is used when no information available on the main price, as a rough estimate on the price of the item
  8. Posted date from amazon portal is scraped from the date of which the item is posted for the first time
  9. Service listing on amazon portal (not an item) is handled as well
  10. If there are results found but prices are all 0, average selling price and lowest selling price will be displayed as 0.0 as opposed to "-". "-" will only be displayed if there are no results found
  11. Functions that do not have access modifiers are purposely made package-private for unit testing purposes.
  12. As scraping craigslist is handled concurrently, the output of the console will only be [int] page(s) of craigslist are being scraped in parallel ... instead of how many pages has been scraped so far, as multiple pages are scraped at the same time / in parallel.

TL;DR

WebScraper to scrape both amazon and newyork craigslist website based on the keyword specified. Utilised multi-threading to support concurrency on craiglist pagination and amazon items' posted date retrieval which significantly improve the performance.


Dependencies

  1. Java 8 JDK with Gradle
  2. JavaFX for GUI framework
  3. JUnit 4.12 for testing suite
  4. Jacoco for test coverage measurement

Running the programme

We configure the project with Gradle. Gradle can be considered as Makefile like tools that streamline the compilation for you.

Compile with Windows Command Prompt

If you want to just rerun the project without rebuilding it,

Compile with Mac/Linux terminal

If you want to just rerun the project without rebuilding it,

Unit test and jacoco coverage report

Some of the unit tests use cached pages from both portals. Testing utilises Reflection method to unit test private functions (not a good practise i know).

Documentation / javadoc

Here for the latest javadoc. Or if you prefer compile it by yourself,