kbalog / uis-dat640-fall2020

Information Retrieval and Text Mining course at the University of Stavanger (DAT640), 2020 fall
7 stars 9 forks source link

A4 Issues #8

Open BerntA opened 4 years ago

BerntA commented 4 years ago

DBPedia is regularly updated, and may update certain pages that will affect this assignment, some people get more than 2743 indexed items now, will this imply that we will all fail the tests for this assignment due to this fact?

And, field mapping probabilities seems to be bugged, has anyone actually passed this test so far?

asahicantu commented 4 years ago

I have not, actually it keeps giving me different number of records and classification in 'gospel'.

I think the mapping probability is using different probabilities for p(f) since according to the rule and by having unifrom p(f) the sum of all probabilities should be 1.. Have not been able to make it work nor pass with give equation. Image

BerntA commented 4 years ago

Yep, I get 2748 now, previously I had 2743 indexed artists. (I still pass the tests I passed previously though)

FebriantiW commented 4 years ago

Me too 2748 and i think it is a bug.. been trying for days and reading papers.... all saying the same thing but still different result...

BerntA commented 4 years ago

Yeah, same here. The formula is very clear, but still applying the given formula does not give the result which the test expects.. The prof. or TA should run through the entire assignment, re-index, etc and confirm..

Wizdore commented 4 years ago

2748 as well.. doesn't pass the assertions. Stuck.

ChristofferHolmesland commented 4 years ago

I'm getting 2748 entities and I'm passing the tests, except get_term_mapping_probs which seem to be bugged.

asahicantu commented 4 years ago

We can suggest a deadline extension since we lost a lot of time trying to figure out these issues once the cells are updated or we are allowed to modify the asserts...

Chrystallic commented 4 years ago

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.

I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.

ChristofferHolmesland commented 4 years ago

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?. I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged. @Chrystallic

You only need the clm object if you implement it according to the supplied formula.

Chrystallic commented 4 years ago

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?. I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged. @Chrystallic

You only need the clm object if you implement it according to the supplied formula.

Ok, my current implementation utilizes only the 'clm' object, but I was curious if someone had gotten plausible answers using both the 'clm' and 'es' objects.

kbalog commented 4 years ago

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually. Also, getting less than 2500 items indexed will not be penalized.

a-nesse commented 4 years ago

Do any of you who have not passed the 'Tests for field mapping probabilities '

still pass the following 'Tests for PRMS retrieval' (the second to last cell, before the end of the assignment notebook)?

I assumed there was an issue due to the get_term_mapping_probs() being miscalculated.

Chrystallic commented 4 years ago

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually. Also, getting less than 2500 items indexed will not be penalized.

@kbalog. If you were aware of this "bug" why have you not informed the students at an earlier date of this issue? Would it be possible to at least provide us with a correct value for one of the fields, for instance what is the expected correct output value for the description field in Pt_f_3_1?

clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pt_f_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pt_f_3_1['description'] == pytest.approx(?????)
tlinjordet commented 4 years ago

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of 'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment, and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:

# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1

The updated test cell for field mapping probabilities is:

# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)

The updated test cell for prms_retrieval is

# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']

and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable names of the code for get_term_mapping_probabilities.

The correct description should be " Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$ for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model. " and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability (normalized over the probabilities of all fields) that a given field $f$ should be the field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:

def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.

    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 

    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t

Use the longer mathematical description under the heading "Retrieval method" in the A4 notebook and note that the mapping probability is $P(f|t)$.

Wizdore commented 4 years ago

Are we going to get an extension? because its just 1 hour till deadline

BerntA commented 4 years ago

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of 'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment, and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:

# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1

The updated test cell for field mapping probabilities is:

# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)

The updated test cell for prms_retrieval is

# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']

and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable names of the code for get_term_mapping_probabilities.

The correct description should be " Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$ for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model. " and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability (normalized over the probabilities of all fields) that a given field $f$ should be the field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:

def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.

    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 

    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t

Use the longer mathematical description under the heading "Retrieval method" in the A4 notebook and note that the mapping probability is $P(f|t)$.

The new tests still seem to be wrong, we all get the wrong answer (we all get the same wrong answer), either you have implemented field mapping probabilities wrong, or we are being heavily misleaded by the description in the assignment leading us to implement it wrong.. (pardon me if I'm being harsh here)

or, are we just indexing different content even though we have the same filtering procedure?

BerntA commented 4 years ago

The second test seems more promising, but still fails. help While the first test is way off @ description help2

FebriantiW commented 4 years ago

I get the same mapping result as @BerntA

FebriantiW commented 4 years ago

passed deadline.. bonus poeng then ?

kbalog commented 4 years ago

Please refer to this page for clarification and guidance. And yes, the deadline has been extended by 48 hours.

FebriantiW commented 4 years ago

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun' prms_retrieval_2 = prms_retrieval(es, prms_query_2) assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

BerntA commented 4 years ago

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun' prms_retrieval_2 = prms_retrieval(es, prms_query_2) assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

Yeah, the results here still depend on the indexing. I get,

'Joseph Summers (musician)',
 'Sun Nan',
 'Stefanie Sun',
 'Sylvia (singer)',
 'Virgil Sturgill'
FebriantiW commented 4 years ago

okay, same same :). Thanks!

tlinjordet commented 4 years ago

For anyone seeing this thread later, see the A4 errata.

kbalog commented 4 years ago

in this case should we paste and keep the new test cell ? or remove it after our test ?

Replace the original test cell with the new one.