Open BerntA opened 4 years ago
I have not, actually it keeps giving me different number of records and classification in 'gospel'.
I think the mapping probability is using different probabilities for p(f) since according to the rule and by having unifrom p(f) the sum of all probabilities should be 1.. Have not been able to make it work nor pass with give equation.
Yep, I get 2748 now, previously I had 2743 indexed artists. (I still pass the tests I passed previously though)
Me too 2748 and i think it is a bug.. been trying for days and reading papers.... all saying the same thing but still different result...
Yeah, same here. The formula is very clear, but still applying the given formula does not give the result which the test expects.. The prof. or TA should run through the entire assignment, re-index, etc and confirm..
2748 as well.. doesn't pass the assertions. Stuck.
I'm getting 2748 entities and I'm passing the tests, except get_term_mapping_probs
which seem to be bugged.
We can suggest a deadline extension since we lost a lot of time trying to figure out these issues once the cells are updated or we are allowed to modify the asserts...
Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.
I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.
Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?. I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged. @Chrystallic
You only need the clm
object if you implement it according to the supplied formula.
Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?. I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged. @Chrystallic
You only need the
clm
object if you implement it according to the supplied formula.
Ok, my current implementation utilizes only the 'clm' object, but I was curious if someone had gotten plausible answers using both the 'clm' and 'es' objects.
Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually. Also, getting less than 2500 items indexed will not be penalized.
Do any of you who have not passed the 'Tests for field mapping probabilities '
still pass the following 'Tests for PRMS retrieval' (the second to last cell, before the end of the assignment notebook)?
I assumed there was an issue due to the get_term_mapping_probs() being miscalculated.
Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually. Also, getting less than 2500 items indexed will not be penalized.
@kbalog. If you were aware of this "bug" why have you not informed the students at an earlier date of this issue? Would it be possible to at least provide us with a correct value for one of the fields, for instance what is the expected correct output value for the description field in Pt_f_3_1?
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pt_f_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pt_f_3_1['description'] == pytest.approx(?????)
Hi,
There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.
Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians'
has been happening more frequently than anticipated when creating the assignment,
and thus the tests will need to be updated.
After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.
The affected test cells are updated and provided below. You can insert and run these as code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.
This is the current, updated test cell following bulk_index
:
# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23
tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1
tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1
tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1
tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1
The updated test cell for field mapping probabilities is:
# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)
Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)
The updated test cell for prms_retrieval
is
# Tests for PRMS retrieval
prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']
prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']
prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']
and the final test cell comparing the above with baseline retrieval needs no update.
None of the other test cells in A4 need updating at this time.
There was a small notational inconsistency in the description and variable
names of the code for get_term_mapping_probabilities
.
The correct description should be
"
Implement the function get_term_mapping_probs
to return $P(f|t)$ of all fields $f$
for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model.
"
and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities
should properly be named Pf_t
rather than Pt_f
, reflecting that mapping probability is the probability
(normalized over the probabilities of all fields) that a given field $f$ should be the
field to which a term $t$ would be mapped.
You may consider this your starting point for implementing that solution, if you find it helpful:
def get_term_mapping_probs(es, clm, term):
"""PRMS: For a single term, find their mapping probabilities for all fields.
Arguments:
es: An active Elasticsearch instance.
clm: Collection language model instance.
term: A single-term string.
Returns:
Dictionary of mapping probabilities for the fields.
"""
Pf_t = {}
# YOUR CODE HERE
raise NotImplementedError()
return Pf_t
Use the longer mathematical description under the heading "Retrieval method" in the A4 notebook and note that the mapping probability is $P(f|t)$.
Are we going to get an extension? because its just 1 hour till deadline
Hi,
There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.
(1) Changing numbers of entities to be indexed
Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians'
has been happening more frequently than anticipated when creating the assignment, and thus the tests will need to be updated.After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.
The affected test cells are updated and provided below. You can insert and run these as code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.
This is the current, updated test cell following
bulk_index
:# Test index tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all') assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23 tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes') assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1 tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description') assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1 tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description']) assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1 tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities']) assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1
The updated test cell for field mapping probabilities is:
# Tests for field mapping probabilities clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul')) Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel') assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5) assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5) assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5) Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul') assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5) assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5) assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)
The updated test cell for
prms_retrieval
is# Tests for PRMS retrieval prms_query_1 = 'winter snow' prms_retrieval_1 = prms_retrieval(es, prms_query_1) assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel'] prms_query_2 = 'summer sun' prms_retrieval_2 = prms_retrieval(es, prms_query_2) assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)'] prms_query_3 = 'freedom jazz' prms_retrieval_3 = prms_retrieval(es, prms_query_3) assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']
and the final test cell comparing the above with baseline retrieval needs no update.
None of the other test cells in A4 need updating at this time.
(2) Field mapping probabilities
There was a small notational inconsistency in the description and variable names of the code for
get_term_mapping_probabilities
.The correct description should be " Implement the function
get_term_mapping_probs
to return $P(f|t)$ of all fields $f$ for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model. " and the Dictionary of mapping probabilities to be returned byget_term_mapping_probabilities
should properly be namedPf_t
rather thanPt_f
, reflecting that mapping probability is the probability (normalized over the probabilities of all fields) that a given field $f$ should be the field to which a term $t$ would be mapped.You may consider this your starting point for implementing that solution, if you find it helpful:
def get_term_mapping_probs(es, clm, term): """PRMS: For a single term, find their mapping probabilities for all fields. Arguments: es: An active Elasticsearch instance. clm: Collection language model instance. term: A single-term string. Returns: Dictionary of mapping probabilities for the fields. """ Pf_t = {} # YOUR CODE HERE raise NotImplementedError() return Pf_t
Use the longer mathematical description under the heading "Retrieval method" in the A4 notebook and note that the mapping probability is $P(f|t)$.
The new tests still seem to be wrong, we all get the wrong answer (we all get the same wrong answer), either you have implemented field mapping probabilities wrong, or we are being heavily misleaded by the description in the assignment leading us to implement it wrong.. (pardon me if I'm being harsh here)
or, are we just indexing different content even though we have the same filtering procedure?
The second test seems more promising, but still fails. While the first test is way off @ description
I get the same mapping result as @BerntA
passed deadline.. bonus poeng then ?
Please refer to this page for clarification and guidance. And yes, the deadline has been extended by 48 hours.
Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not
prms_query_2 = 'summer sun' prms_retrieval_2 = prms_retrieval(es, prms_query_2) assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']
Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not
prms_query_2 = 'summer sun' prms_retrieval_2 = prms_retrieval(es, prms_query_2) assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']
Yeah, the results here still depend on the indexing. I get,
'Joseph Summers (musician)',
'Sun Nan',
'Stefanie Sun',
'Sylvia (singer)',
'Virgil Sturgill'
okay, same same :). Thanks!
For anyone seeing this thread later, see the A4 errata.
in this case should we paste and keep the new test cell ? or remove it after our test ?
Replace the original test cell with the new one.
DBPedia is regularly updated, and may update certain pages that will affect this assignment, some people get more than 2743 indexed items now, will this imply that we will all fail the tests for this assignment due to this fact?
And, field mapping probabilities seems to be bugged, has anyone actually passed this test so far?