qfjp / MachineLearning

Repo for machine learning class
0 stars 0 forks source link

Project Roadmap #3

Open qfjp opened 8 years ago

qfjp commented 8 years ago

An area for general discussion about the progress of the project.

qfjp commented 8 years ago

Also the following part is not clear for me

"I suggest we work on implementing the metrics I provided for the amazon dataset to see if we can find any metric with a strong correlation to any piece of the associated metadata."

If you recall the 'playing tennis' dataset we started out with, the goal was to find which methods of classification (wind, sky, etc.) has the lowest entropy associated with what you are trying to predict (playing tennis). Here, since working with graphs is less clear, we should implement as many methods of scoring as possible, and rank its entropy to each piece of metadata. That way we can find a basis on which metrics are 'better'.

Also, What is the Ts for this dataset. What do you target from this one. If you take a look on thier graphs you can find that their graphs build on the relation between two product as a co-purchase

It looks to me like the Amazon metadata has at least three attributes not included by the graph: salesrank, categorization, and user reviews (which could probably be further divided into subcategories). These would all be the values that we are interested in scoring.

kabdelfatah-zz commented 8 years ago

So now you have to build your own graph to take into your considerations these attributes

qfjp commented 8 years ago

No, since we would just be using the SNAP dataset. As an example, here is a single item from the metadata dataset:

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

The similar items are given by ASIN, so it would be trivial to put this into SNAPs API and get an equivalent amazon product graph, while we could also maintain a hashmap of <ASIN, metadata> pairs to use as our training data.

kabdelfatah-zz commented 8 years ago

I am still confusing. What is the exact target attribute that you want to predict. are they salesrank, categorization, and user reviews?!!

qfjp commented 8 years ago

We're trying to find which is the easiest to predict. There's relatively low cost in trying them all, so let's do that and see what get's the best result.

kabdelfatah-zz commented 8 years ago

I am still see that the graph is built based on the relation between products. First there is no weight for the edges into the graph. Which i think this is very essential to extract the graph features. Second, The graph is built based on the co-purchased products like laptop and mouse. But we need relations between similar products which this can help us to predict the user reviews or average rate. Therefore, we need to build our own graph that can lead to this. Do you agree with me?

qfjp commented 8 years ago

I don't know where these requirements are coming from. Why do we need a weighted graph?

Also, I don't follow what you mean by 'similar products'. The SNAP Amazon dataset (without metadata) is also based on the "customers who bought this item also bought feature", so I think it makes sense to use a metadata graph that is directly equivalent.

kabdelfatah-zz commented 8 years ago

I think weighted graphs can give more intuition about how the nodes are correlated.

For the second point, that is exactly what I mean. The graph is connected between nodes based on only one feature for the nodes (which is the co-purchased). But there are many other factors I see we need to add to this graph to make more connected between nodes. This should reflect how these are correlated. Like the common users or common tags and so on.

Again our target is to transfer knowledge between two different datasets. So, we need to build a graph that shows strong correlation between nodes.

Even if we will start with the Amazon dataset, we still to build our graph.

qfjp commented 8 years ago

Since we already have an anonymous amazon dataset, along with an equivalent amazon dataset with metadata, we should use these to train and test our algorithm. Here are the steps:

  1. Score the Amazon metadata set using the algorithms I list in the writeup.
  2. Run several instances of the algorithms we spoke of in class on a subgraph:
    • Classification is simple, since we can directly create a decision tree over all the classifications in step 1
    • Linear regression, neural networks, etc. will probably need to be applied on each pair of classification and metadata classification (x and t) since we know how to apply these only using single vectors.
  3. Test the algorithm on the subgraph that is disjoint from the one used in step 2
  4. Apply our algorithm to the blank graph
qfjp commented 8 years ago

As for the other concerns

I think weighted graphs can give more intuition about how the nodes are correlated.

This may be the case, but we don't have weighted graphs. If you want to use an algorithm to guess the weights in a graph that's fine, but I don't think we can guess the weights, then use this to guess information about the node.

The graph is connected between nodes based on only one feature for the nodes (which is the co-purchased). But there are many other factors I see we need to add to this graph to make more connected between nodes.

There may be other factors in the metadata, but there are no other factors once we apply these to the anonymized datasets. This is why we can only classify against features of the graph (node degree, bipartition, etc.) rather than against the metadata. The metadata is what we are trying to guess.

kabdelfatah-zz commented 8 years ago

As for the other concerns??!!

qfjp commented 8 years ago

Since I have done some work with graphs before I think it's fair I work on trying to use some of the algorithms to score the SNAP datasets.

Jie: Since you have some material regarding regression and NN, perhaps you could work on translating this to interface with some code on SNAP graphs. We don't even necessarily need them to work directly together (we could have graph -> scoring -> files -> your code), but we should at least come up with a common "format" so we know how your program expects to read data.

Yiying: Since you would like to work on methods of scoring graphs, feel free to choose shortest path (and potentially others) and try to get some usable output, whether you are printing to a file or just expect your methods to be used by someone elses code. Just let me know if you choose anything other than shortest path so we don't duplicate work.

Kareem: This probably leaves the most logical split as you working with Jie on the actual machine-learning side of things, but depending on how developed Jie's code is I think it's also reasonable if you want to start implementing more graph algorithms. I'll leave the choice up to you.

For everyone: If you are going to submit code to github, try to only work in your own branch. I.e. each of us works in a branch under our own name, then we will merge them back into master. I think the most sensible workflow will look something like this:

master
 |- Regression
 |   |- Kareem
 |    ` Jie
  `- Graph Metrics
     |- Yiying
      ` Dan
YiyingW commented 8 years ago
screen shot 2015-10-09 at 9 15 00 am

Is this the graph we are working on now? Why is it said 'directed'? Do we consider difference between 'one buys product A like to buy B' and 'one buys product B likes to buy A'?

YiyingW commented 8 years ago

Dan: Since this is a directed graph. I will work on finding all SCC first. Just to let you know so we won't duplicate work.

qfjp commented 8 years ago

You're right, it's directed. I was paying attention to the 'Ground truth communities' (com-Amazon) networks. SCCs are fine, though the algorithm for shortest path doesn't change much between directed and undirected graphs.

Do we consider difference between 'one buys product A like to buy B' and 'one buys product B likes to buy A'?

It actually looks like SNAP has datasets that do both, the one you point to above uses the relationship you describe, while the com-Amazon just regards these as undirected links. I guess this means we can do either, just be sure to document which one you use.