prmr / Creco

Recommendation System for Consumer Products
Apache License 2.0
6 stars 2 forks source link

Create the Data API #2

Closed enewe101 closed 10 years ago

enewe101 commented 10 years ago

Make an API for loading products into the following classes:

The classes in the product object graph should be immutable.

prmr commented 10 years ago

Basic demo with GSON in branch Issue0002

enewe101 commented 10 years ago

In the persistence branch I have a working version of a java representation of the CR database. To see how it works, checkout the branch then run CRDataExample, which is in the ca.mcgill.cs.creco.persistence package.

Users of the package should mainly interact with an instance of the CRData class, and the CategoryList and ProductsList that are returned from crData.getCategoryList() and crData.getProductsList(). The main usecases that I anticipated are shown off in the example.

At the moment, this CRData representation enhances the raw CR data in several ways

I'm not 100% sure if this committed properly -- git seems to work differently inside Eclipse. If anyone is able to pull and run, please let do me know! At the moment I cannot figure out how to add files nor to commit, but I believe I have found a funny trick to force Eclipse to do so. Also, merging causes my files to revert (obviously I'm using this incorrectly).

prmr commented 10 years ago

Moved to the data package. We should have a data.loader package for the JSON classes, and make the stub objects package-private.

prmr commented 10 years ago

Why was this issue closed?

asutcl commented 10 years ago

Sorry I clicked the wrong button, I was trying to write a comment. Which was: Attributes should probably have a field instructing what kind of value they hold instead of having to check the class of the value every time you want to use that attribute.

prmr commented 10 years ago

See additional requests in Issue #15

enewe101 commented 10 years ago

I've merged Issue0002 into master, and I am merging this into the persistence branch. May I close the Issue2 branch, since it is now subsumed in persistence?

prmr commented 10 years ago

Sure, it was demo code anyways.

enewe101 commented 10 years ago

Data API implemented using stubs. Merged with master. (Finally !! :)

prmr commented 10 years ago

I'm wondering why the .gradle is back under source control. This generates uncommitted changes every time we run the build.

enewe101 commented 10 years ago

Removed again. I re-added them because I noticed that untracking them causes them to be deleted for anyone who pulls. But now I realize that it's not a problem -- they just get rewritten next build...

mangalagb commented 10 years ago

In the ca.mcgill.cs.creco.data.Product .java file, there is a tiny error. The method getName returns displayName but setName sets the name . Could you fix it in the master branch?

asutcl commented 10 years ago

Quick question: Do we know if having a valueMin and having a valueEnum for a Attribute is mutualy exclusive? Seems to be what the code is doing. Maybe I should run a quick test though, to make sure there aren't any errors in the CR database. I will start an issue on testing the data API and take ownership of it. Everyone let me know if they need some things tested for their part.

enewe101 commented 10 years ago

Gowri -- thanks for the heads up. I think the solution will be to kill the setter. That is leftover from my first implementation, and the intention now is to have the data object graph immutable.

(Some thoughts about names in general: as you may have gathered the mapping of getName() to the displayName is to provide consistency between products that have displayName and categories that have singularName. The name field of products seems to be an old db key that I judged to not be useful. I think there is a getter for it still if its desired, I'll verify).

enewe101 commented 10 years ago

Andrew -- good Q. In fact, they are not mutually exclusive. Often an attribute is numeric most of the time, but is sometimes a string having the semantics of N/A.

Example: see the "Claimed maximum run time" attribute (attributeId : "6882") of humidifiers (categoryId : "32968"). Most entries are numeric strings, which do produce a valueMin and valueMax, but some are "Data not available" which is caught by valueEnum.

Note this detail: numeric strings are understood as numeric and affect valueMin / valueMax

asutcl commented 10 years ago

Do you know if this also happens when values are real numbers or only when values are numeric strings? The use of numeric string might indicate that a certain field can only take a few discrete numeric values. We may want to treat them as enums in this case since they might not have linear scaling, much like a Likert scale.

If we treat them as a numerical value then we will have to handle these (using default scores maybe) and make sure it is clear to the user in the end that these values were missing. On the other hand if these are kept as strings then an N/A string could be handled more easily.

In the testing branch I will compare keeping numerical strings as strings as opposed to converting them to numbers. We should discuss this more but my suggestion is that if taking numerical strings as strings solves the issue then we should opt for that, since I assume there is a reason why the CR people chose to use strings in these cases as opposed to real numbers.

enewe101 commented 10 years ago

I think they use strings for numeric fields whenever they needed to put N/A, due to a db restriction.

My feeling is that the fact that a numeric field has been encoded using strings will be a poor indicator that the values are discrete and enum-like (E.g. I don't think it's the case with the claimed maximum runtime example above). However, it is worth looking into.

Have you tried looking into AttributeStat.getFilterWidget() -- I have not but maybe it tells if a selector or a scale should be used, which would be a clue.

There is also AttributeStat.getPresentationFormat() which might be another source of hints as to the nature of a field.

I did consider putting all values in valueEnum. One con, and the main reason I didn't, is that it becomes impossible to tell at a glance the numeric from the non-numeric values, while the current method gives such a breakout. I chose to summarize numeric and String values seperately for that reason. If we put all numeric values in valueEnum, it will also become huge.

Having said all this, the the best solution is the one that best serves the clients of the API (and is efficient). If you have different data-build-time aggregation needs that should be done once during data-build, we can def put them in!

Also, you can bypass parsing done by the API, if so use attribute.getOriginalValue()

enewe101 commented 10 years ago

Gowri -- Ok, so, currently in master branch:

name has second-citizen status because it does not appear to be useful. Does this make sense?

mangalagb commented 10 years ago

Yes. That clears my doubt. I'l index the products in lucene based on display name. Thanks !

asutcl commented 10 years ago

My current exploration seems to indicate that there is a mix of numerical values, numerical strings and text strings for some attributes. This will have to be handled somehow.

asutcl commented 10 years ago

Here are a few representative examples. I will push a test that prints out all of these. I haven't looked at the filter widget or presentation format yet just remembered. Hopefully this will solve the issue but in the mean time if anyone has any idea how to deal with this please don't hold back.

The format for the examples is:

Category.name:Attribute.name min, max string values

Granola: Serving size min: 1 max: 1 Enum: 2/3, 1/2, 3/4, 1/4,

Digital SLR camera: 35mm equivalent multiplier min: 1.5 max: 5.6 Enum: Data not available, 27.5-83, 24-70, 27-84, 28-112, NA, 24-75, 27-81, 28-84,

Push mower: Engine size min: 140 max: 190 Enum: 24 volts, 13 amps, 36 volts, 40 volts, 12 amps,

14-inch models: Depth min: 8.9 max: 9.7 Enum: 9.6", 9.7", Data not available, 9.9", 9.8", 9.2", 9.5", 9.0", 9.4", 11.1", 9.1", 9.3",

Generator: Claimed output min: 5000 max: 7500 Enum: NG 7000 LPG 8000, NG 7000 LPG 8500, NG 6000 LPG 7000, NG 12000 LPG 14000, NG 6000 LPG 8000, NG 13000 LPG 14000, NG 11500 LPG 13000,

Top-loading washing machine: Water levels min: 0 max: 28 Enum: Infinite, Auto, Auto + 4, Auto plus 3,

enewe101 commented 10 years ago

Good examples! Let's make an issue out of this. IMO it's out of scope of the API and sufficiently complicated to need special attention.

Let me suggest that we get the basic pipeline with attribute selection in place, before devoting resource to this question. Does that make sense? So I suggest tagging said issue to milestone 2.

asutcl commented 10 years ago

Sounds good.

enewe101 commented 10 years ago

The data API is complete and integrated. I'm closing this issue. For bugs or enhancements, please file a new issue.