snowkylin / data-mining-pku

Instruction of assignment in course "Data Warehousing and Data Mining Technology", Spring and Summer Semester, 2017.
7 stars 9 forks source link

Some ambiguous points about our homework #1

Open LayneIns opened 7 years ago

LayneIns commented 7 years ago
  1. The Task.1 and the Task.2 seem to share a common title in the slides. Is it an accident? If so, what's the original description of Task.2? Analyze the correlation of different attributes of the data given and show it in a table or graph?

  2. The Task.3 asks us to finish a phrase mining task based on the frequent pattern. Here, "phrase" seems to be ambiguous. For example, if a word A is always followed by a adjacent word B, then we call "A B" as a phrase. However, if A always appears with B but they are not adjacent, like "A ***** B", then, should "A B" also be a phrase? I am not sure if I am clear.

Thank you.

PkuDavidGuan commented 7 years ago

助教好,我和楼上一样,也对task 3的具体任务存疑。我不清楚所说的频繁项集是在什么范围上的,是对整个数据集,还是针对数据集中的某一本书,还是针对某本书的某一章节。因为所给的数据集包括3000本左右的不同类型的图书,频发模式可能各异,如果在整个数据集范围进行挖掘,最后可能只能筛选出类似“I am”,“He is”之类没有意义的项集

snowkylin commented 7 years ago

@WillsNew

  1. Yes, they share a common title, but they focus on different kind of graphs. For task 1, you need to show the distribution of data. For task 2, you need to show the correlation of different attributes.
  2. The "phrase" here is more general. Two words can be called "phrase" when they always appear together in one sentence, even if they are not adjacent. Sorry for this ambiguous point and I will update the page soon.

@PkuDavidGuan I think you can apply some "stop word" methods, which is common in NLP related tasks. You can select part of books that is related to a specific topic. For example, all books written by Charles Dickens, all fairy tale books or all books that is written in 18th century.

Hope that helps.