prmr / Creco

Recommendation System for Consumer Products
Apache License 2.0
6 stars 2 forks source link

Redesign the data API #44

Closed prmr closed 10 years ago

prmr commented 10 years ago
prmr commented 10 years ago

enewe101, what's the count field in Category? Is that product count?

prmr commented 10 years ago

Created new branch Issue0044

enewe101 commented 10 years ago

count is the number of products that are in the category. There is also a CR-native field which specified this, but it was wrong. Not sure if any of the API's users rely on count. On Mar 11, 2014 4:58 PM, "prmr" notifications@github.com wrote:

enewe101, what's the count field in Category?

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37349020 .

prmr commented 10 years ago

Guys you have the token. Category2 is the Category class we want to used. To facilitate transition, I left all the old functionality and built a parallel getCategories() accessor for Category2 objects. Right now the CRData constructor works as previous, but at the end simply builds an index of Category2 objects. Please have a look at the new structure and put in additional tests of the equivalence classes.

enewe101 commented 10 years ago

Ok that sounds good. I'll look into the bug you identified in which non-equivalence classes are being returned.

On Tue, Mar 11, 2014 at 8:08 PM, prmr notifications@github.com wrote:

Guys you have the token. Category2 is the Category class we want to used. To facilitate transition, I left all the old functionality and built a parallel getCategories() accessor for Category2 objects. Right now the CRData constructor works as previous, but at the end simply builds an index of Category2 objects. Please have a look at the new structure and put in additional tests of the equivalence classes.

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37361881 .

prmr commented 10 years ago

As part of some work on issue #50 I've added this method to IDataStore:

/* * Returns a product from the database. * @param pId The id of the product. * @return The product with id pId / Product getProduct(String pId);

enewe101 commented 10 years ago

We had talked about removing the distinction between specs and ratings from the Data API, and simply having a boolean inside Attribute and AttributeStat called, e.g. isRatingType, and a corresponding getter, isRating().

I'd like to make getAttribute and getAttributes inside Category, and for milestone 0.3, depricate getRating, getRatings, getSpecification, getSpecifications. These could then be removed for milestone 0.4.

prmr commented 10 years ago

Can you rephrase the 2nd paragraph?

prmr commented 10 years ago

Assuming you're doing the work in branch Issue0044, I strongly recommend merging the master branch into the issue branch. There's a major rework of TypedValue that's now available.

enewe101 commented 10 years ago

Ok, sounds good (Re merging master into Issue0044).

To clarify second Paragraph:

I will be speaking of specs and ratings, which are of course of type Attribute, but in the following, this should be applied equally to the AttributeStat manifestations of specs and ratings.

We had agreed that there is no functional distinction between specs and ratings from the perspective of the CRData, and any of the helpers that build it, such as CategoryBulider. I'm thinking that, to simplify the CRData interface, I would depricate the methods getSpecification(), getSpecifications(), getRating(), and getRatings(), and provide in their stead getAttribute() and getAttributes().

Clients of the CRData could filter if they wanted, by using a method added to Attribute (and AttributeStat) called isRating(). This is, I believe, as we discussed already. My goal is just to confirm and make a trace in the issue.

In fact, it is not necessary to actually depricate getRatings (and its three closely related cousins). I could simply implement them by filtering using isRating() inside CRData, and hence make no change to the interface. The reason that I lean toward providing the unified getAttributes() is because, based on discussions in class, I think that the distinction between ratings and specs throughout the application is actually not productive, and this would help move away from that.

So -- depricate getRatings(), getRating(), getSpecifications(), getSpecification() in CRData, and implement getAttributes(), getAttribute()

-- OR --

Remove distinctions between rating and spec handling under the covers in CRData implementation but make no alteration to the interface.

prmr commented 10 years ago

I assume you mean in Category, but yes, that's the right idea. BTW, there's no clear conceptual link between a category and a rating/spec, because ratings/specs are naturally associated with products. It will be useful, in the Javadocs for Category.getAttribute(String), to document exactly what this returns (i.e., the aggregated attribute values for all the products in a category).

enewe101 commented 10 years ago

Oh right I said in CRData when I meant Category.

10-4 about documenting the return of Category.getAttribute is actually returning an AttributeStat, and why that makes sense.

Alright, I'm ready to get out the scalpel. Here's my game plan. This should be unsurprising, and simply a re-statement of what we discussed. I'm just exposing my plan to leave a trace in the issue, and, in case there is anything that is misunderstood. My plan involves creating one more class that I don't think we originally intended: CRDataBuilder. It's role should be clear from below.

prmr commented 10 years ago

I think it might be better if CRData mediated the relation between the JsonLoader and the builder. The idea here is to make it possible to test the builder separately with synthetic data, i.e., without loading the whole database. That was the idea of the CategoryTree class.

enewe101 commented 10 years ago

Ok, good -- I think we've discovered a point I wasn't clear on :)

So, does this mean that the helper functions I was talking about moving into CRDataBuilder would be moved into CategoryTree? In that case, CRData does keep a reference to JsonLoaderService, and is responsible for handing over the outputs from JsonLoaderService to, CategoryTree. meanwhile, CategoryTree takes on most of the build logic that, in my game-plan above, I had put into CRDataBuilder.

This leads me to seeing that CategoryTree could be tested with synthetic data, which would be written as a test set of CategoryBuilders and Products (written in Java, not json). Does it sound like I have understood?

On Fri, Mar 14, 2014 at 3:02 PM, prmr notifications@github.com wrote:

I think it might be better if CRData mediated the relation between the JsonLoader. The idea here is to make it possible to test the builder separately with synthetic data, i.e., without loading the whole database. That was the idea of the CategoryTree class.

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37683698 .

prmr commented 10 years ago

Yup, if CategoryTree is the implementor of IDataCollector, we can just pass it some dummy CategoryNodes in the tests.

enewe101 commented 10 years ago

Ok that seems totally clear now. Thanks for the feedback!

On Fri, Mar 14, 2014 at 3:18 PM, prmr notifications@github.com wrote:

Yup, if CategoryTree is the implementor of IDataCollector, we can just pass it some dummy CategoryNodes in the tests.

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37685274 .

enewe101 commented 10 years ago

@asutcl, @prmr: I just caught this in discussion of a different issue, said by prmr:

"We just decided to move the AttributeStat functionality into the logic component"

Was this where we left off? Makes a lot of sense to me, but I'm not sure now exactly where responsibility for building AttributeStat falls, now, or when this is done. Should building the data API create AttributeStat or not? Perhaps the code building the data API calls on functionality inside .logic pkg? -- this would seem to be sensible.

Comments?

asutcl commented 10 years ago

Hi Edward,

I was going to comment and this but you got to it before me. The idea is that once this is done we won't need AttributeStat and anyone who will and a list of attributes should go through the AttributeExtractor which will provide the corresponding API. So no need to build AttributeStat in the data.

This is mainly to make things cleaner, to make it easier to debug and to ensure that everything in the data is immutable. I will take this over and it will be done by Thursday. Does that answer your questions?

This will also help to deal with all the cases where the database has "unclean" data.

enewe101 commented 10 years ago

Yup, that makes sense. So I'll just not make any changes to AttributeStat.

By the way, within your AttributeExtractor, do you make any functional distinctions between specs and ratings? (By functional distinction, I mean beyond just storing them in distinct lists, actully doing distinct manipulations?)

In the data API I don't need to make any distinctions and am planning to make a change to Attribute, which is to add a bit that flags whether it is a rating or a spec. Otherwise I have constant duplication of methods for Ratings and Specs that have actually no functional distinction.

On Mon, Mar 17, 2014 at 12:07 PM, asutcl notifications@github.com wrote:

Hi Edward,

I was going to comment and this but you got to it before me. The idea is that once this is done we won't need AttributeStat and anyone who will and a list of attributes should go through the AttributeExtractor which will provide the corresponding API. So no need to build AttributeStat in the data.

This is mainly to make things cleaner, to make it easier to debug and to ensure that everything in the data is immutable. I will take this over and it will be done by Thursday. Does that answer your questions?

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37834169 .

asutcl commented 10 years ago

I think we decided not to have any distinction for now. It would probably a good idea to collapse specs and ratings to attributes. Having a flag would allow us to add distinctions later. It would clean up the API also since right now with an AttributeID you don't know if it is a spec or rating and you have to check in both.

enewe101 commented 10 years ago

Great, so the API will feature Categorys with the methods getAttributes() and getAttribute(String id). The getRatings() etc will exist but be deprecated.

On Mon, Mar 17, 2014 at 4:20 PM, asutcl notifications@github.com wrote:

I think we decided not to have any distinction for now. It would probably a good idea to collapse specs and ratings to attributes. Having a flag would allow us to add distinctions later. It would clean up the API also since right now with an AttributeID you don't know if it is a spec or rating and you have to check in both.

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37865020 .

asutcl commented 10 years ago

One functionality I was wondering about though is for the string value of type "min-max" (eg "10-20"). We had thought of parsing these and replacing them with the mean value.

I was thinking of handling these, the only issue I have is that I don't have a way of passing these down stream as products attribute values will probably be extracted from the products later on by the product ranking algorithm.

The up side of handling it in the logic part is that if we can guess which if lower or higher values then we can replace it by either the min or max correspondingly.

The down side is that there is no elegant way to pass this information down stream.

For this reason I think if at extraction from the database these string could be parsed and converted to their mean value as a numeric value, this would clean up part of the data.

On a similar note, could we parse out the inches signs from strings as to convert these into numeric value. Potentially we could have a list of units we want to parse out of the strings as to convert into numeric values.

enewe101 commented 10 years ago

Er, what I said applies to Product, not Category... anyway, you know what I meant

On Mon, Mar 17, 2014 at 4:41 PM, Edward Newell edward.newell@gmail.comwrote:

Great, so the API will feature Categorys with the methods getAttributes() and getAttribute(String id). The getRatings() etc will exist but be deprecated.

On Mon, Mar 17, 2014 at 4:20 PM, asutcl notifications@github.com wrote:

I think we decided not to have any distinction for now. It would probably a good idea to collapse specs and ratings to attributes. Having a flag would allow us to add distinctions later. It would clean up the API also since right now with an AttributeID you don't know if it is a spec or rating and you have to check in both.

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37865020 .

asutcl commented 10 years ago

Haha, yeah didn't even notice the typo when reading... Although this could be done for Category too no?

enewe101 commented 10 years ago

Yeah, but I thought from the discussion that we are considering moving away from serving AttributeStat from the data API.

Well, I was about to write another post looking at the issues you raised in your last message. It's not clear to me where the concern should be handled. Surely data should be cleaned up at load time (i.e. as part of building the data objects behind the data API). This tends to make me think that the more sophisticated comprehension of aggregate properties of attributes should be done at data load time too.

Would it be beneficial to bring some of your data aggregation / analysis closer to the code that builds the data behind the API, such that it would be run at data load time? Surely some of it should be run as a one-off calculation, like what we have with AttributeStat now...

Well, let's talk in class tomorrow about it!

On Mon, Mar 17, 2014 at 4:49 PM, asutcl notifications@github.com wrote:

Haha, yeah didn't even notice the typo when reading... Although this could be done for Category too no?

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-37868202 .

asutcl commented 10 years ago

Since the products and attributes are immutable, my intuition would be that the values extracted from the database would be the ones that are actually used and displayed and that all statistic or aggregate information about these could be computed by the logic.

But yes as you pointed out it would be best to discuss this tomorrow in class.

prmr commented 10 years ago

I see that the changes from branch Issue0050 haven't been pulled into master, so I'll implement Issue #57 into branch Issue0050. You should integrate master into this branch sooner than later.

enewe101 commented 10 years ago

Ok I'll pull master into #44 now.

enewe101 commented 10 years ago

@ceipher I just added Product.getUrl(). It returns a string, but for 583 out of the ~2400 products, no url is available and it returns null

enewe101 commented 10 years ago

I'm having major trouble figuring out how enums work, can somebody help me understand? What is the simplest way to use enums. I've read many examples that show how you can make special constructors and all kinds of weird stuff, and yet, none show just how to use it plainly in the context of a class. I'm looking for something like:

public class Person {
   private String name;
   private enum aState { SLEEPY, HUNGRY, HAPPY};

   public Person(String pName, String pState)
   {
      aName = pName;
      aState = new aState(pState);
   }
}

And then,

public class PersonTest {
   public static void main(String[] args) 
   {
      Person testPerson = new Person("Henry" , "SLEEPY");
   }
}

There are at least two problems here:

prmr commented 10 years ago

You shouldn't need constructors for enums. Have a look at TypedValue in the master branch, or http://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

enewe101 commented 10 years ago

Thanks!! Ok, now I see that person is missing a line:

public class Person {
   private String name;
   private enum State { SLEEPY, HUNGRY, HAPPY};
   *private State aState;
}

Actually http://docs.oracle.com/javase/tutorial/java/javaOO/enum.html was my first read, but didn't help. They never show enum being used as a field -- a spectacular ommission, since, unlike other fields, enum requires _two_ statements to declare a field.

enewe101 commented 10 years ago

After merging master into Issue44, I'm having quite a time trying to resolve errors in TestFiltering and TestAttributeCorrelator!

One issue is that is making this tough is that, even if I run tests in debug mode, Eclipse doesn't stop on exceptions, and doesn't print a stack trace. Is there some special way to ask Eclipse to be more informative in testing? The Internet says that I should just run in debug mode, but that has no effect for me.

asutcl commented 10 years ago

Can I merge the correlation in the ScoredAttribute into #44?

enewe101 commented 10 years ago

Hey Andrew -- sorry for the lag. Yes, go ahead.

On Thu, Mar 20, 2014 at 9:01 PM, asutcl notifications@github.com wrote:

Can I merge the correlation in the ScoredAttribute into #44https://github.com/prmr/Creco/issues/44 ?

Reply to this email directly or view it on GitHubhttps://github.com/prmr/Creco/issues/44#issuecomment-38238811 .

enewe101 commented 10 years ago

Update on Integration: Hi everyone!

Last night's issue with Java versions was just the beginning of my local config woes... But I'm now definitely in a working state. I've created the Integration branch we will use for milestone 3. This branch is available now, and it includes everything that was in master merged with #44 as they were at the time of posting. All tests pass and application is behaving as expected.

When you see this post, kindly call out a quick update about the status of your work, specifically:

Any further work should get merged into Integration (and pushed!). If you suspect or observe that there are impacts to others code, please just post.

Thanks, and sorry for the delay on getting the Integration branch out!

enewe101 commented 10 years ago

Just pushed the collapsing of Specifications and Ratings to Attributes. Changes were propogated to the AttributeExtractor, AttributeCorrelator, RankedFeaturesProducts, and associated tests. The constructor for Attribute is now private -- use the static factory methods buildRating, buildSpecification, and buildPrice, which have the same call signature as the constructor used to have.

Integrated to Integration

nishanthtgwda commented 10 years ago

Probably can be done for Milestone 0.4. The ratings and specifications extracted after parsing the json file based on types of attributes are attached directly to the category. Probably a new attribute inside the category class can be introduced to store all the attributes under it(not ratings and specifications exclusively ) and in the Attribute class, have a field to determine the type (not the type of value string, boolean, but whether its a Spec or a rating). Ultimately, its how we use the API, but since we are considering the solution of merging the ratings and spec,this might be one way to do it.

enewe101 commented 10 years ago

I'm closing this issue. For further additions / modifications please open an issue for milestone 4. That said @nishanth1991 I'm not sure if I understood you correctly. At the moment the type of attribute can be found by doing Attribute.isPrice(), Attribute.isSpecification, or Attribute.isRating() -- maybe that satisfies what you were looking for?

nishanthtgwda commented 10 years ago

Ah okay . Is there any function which will return all the attributes instead of asking exclusively for ratings or specifications ?

And I added in test cases . All of them are passing this time (unlike last time where it failed for custom json string) . Nonetheless is it possible to remerge the branch into integration?. I have pushed the code to the branch.

enewe101 commented 10 years ago

Cool thanks Nishanth, so I guess you meant it's merged with #44? I'll check and merge into Integration if needed.

And so yeah, the function Product.getAttributes() gets all the Attributes (unless you're talking about Categories and AttributeStat -- I use AttributeStat for calculations, and come to think of it, I'd like to make that package-private, since from outside I think people should be relying on AttributeExtractor's services)

enewe101 commented 10 years ago

Merged into Integration -- so now I'm planning to delete this branch on the remote repository, but I'll hold off until we chat later in case that would be bad.