techmatters / soil-id-algorithm

GNU Affero General Public License v3.0
1 stars 0 forks source link

bug: site (rank_soils) with no data may return different soil result than (list_soils) #126

Open knipec opened 2 months ago

knipec commented 2 months ago

Description

My limited understanding is that a site with no data should be equivalent to a temporary location, from the perspective of soil id. From an algorithm perspective this means I'd expect rank_soils with no data entered to give the same Top Match soil as list_soils would give. (Unless there is data that secretly gets added by default when a site is created -- I wouldn't expect that, but I'm not confident it doesn't happen?) However, the app does not behave this way.

Steps To Reproduce

  1. In the app's map search bar, type 47.65712, -122.53839. Wait a bit, and notice the Soil ID Top Match is "Harstine".
  2. Click the button to create a site. Name it something and save.
  3. Click on the new site to see its site dashboard. Do not enter any site data. Expected: The site dashboard also lists its Soil ID Top Match as "Harstine" Actual: The Top Match is now "Dystric xerorthents"

Additional context

      "name": "Harstine3",
      "component": "Harstine",
      "componentID": 23979090,
      "score_data_loc": 1.0,
      "rank_data_loc": "1",
      "score_data": 0.499,
      "rank_data": "1",
      "score_loc": 1.0,
      "rank_loc": "Not Displayed",
      "componentData": "Site data only"

and

      "name": "Dystric xerorthents",
      "component": "Dystric xerorthents",
      "componentID": 23978926,
      "score_data_loc": 0.784,
      "rank_data_loc": "2",
      "score_data": 0.495,
      "rank_data": "2",
      "score_loc": 0.578,
      "rank_loc": "2",
      "componentData": "Site data only"

It also contains "Harstine1" and "Harstine2", which seemed surprising to me.

I'm not quite sure how to make sense of it -- but wondering is "rank_loc": "Not Displayed" unexpected?

Also, looking at the full list of top soil matches on the app, "Harstine" does not show up at all.

Chosen solution

Call rank_soils only once the user has entered data. See conversation below for details.

knipec commented 2 months ago

Also please let us know if you think that output is correct, and we should be interpreting it differently on our end! Thanks!

CourtneyLee333 commented 2 months ago

I just experienced what I think is the same issue. I dropped a temporary pin and opened the temp location dashboard, then tapped the Soil ID tile. I got two matched under Top Soil Matches. I then created a site at that location from the temp location dashboard. Upon opening the Soil ID of my new site I only got one Top Soil Match. Lat: 35.5785 Long: -82.60405

jjmaynard commented 2 months ago

@knipec @CourtneyLee333 @paulschreiber @shrouxm We should schedule a call to discuss this. The responses from the API are correct. Interpretation is a bit confusing and probably easier to go over on a call. Recent iterations to the rank_soil function have incorporated site slope and elevation from digital map sources, which explains the different similarity score for score_data relative to score_loc when no data has been entered.

CourtneyLee333 commented 2 months ago

@DerekCaelin

DerekCaelin commented 2 months ago

I added this to the topic list for our Monday algo call.

DerekCaelin commented 1 month ago

@knipec After talking with Jon, he agrees that the Temporary Location and the Site Location with no entered data should show the same "Top Match", but it IS expected that a site with no entered data would return a different result from a temporary location, because different data is being incorporated.

He suspects that the ranking process is being called for a new site, which shouldn't happen. Expected behavior is that an empty site inherit the data from the temporary location it comes from.

shrouxm commented 1 month ago

@DerekCaelin i agree with @jjmaynard we should just go over this on a call

DerekCaelin commented 1 month ago

@shrouxm I'll add this bug to the next algo discussion (7/29) and invite you

DerekCaelin commented 1 month ago

Options might be:

A)

C) OR Update algorithm so slope is considered part of location score.

@shrouxm to review this solution and refile as a couple issues

shrouxm commented 2 weeks ago

ok i have thought about this a bit more, and i think C is definitely the most correct solution. any other option could potentially create confusing outputs for the user. consider:

for A: we could present the data+location score output of calling rank with no inputs as the location score to the user. but then when they input data, the data score output of calling rank will include some data that was previously part of the location score calculation

for B: we still have the problem that once we call rank when data is entered, there's information flowing into the data score which is not based on what the user has input. e.g. the user could input texture but will then see the results reranked taking into consideration a slope value inferred from a map

both those situations are less glaringly obviously confusing than the original bug that started this thread, so we could choose to implement them for that reason. but ultimately it seems to me that if we're pulling data from maps, it should all get included in the location score, and updating the algorithm to work that way is the only way we can present totally consistent information to the user.

so i guess follow-up questions i have for others:

CourtneyLee333 commented 2 weeks ago

@shrouxm Does the data being pulled from the map include slope (which may or may not be accurate)?

DerekCaelin commented 2 weeks ago

@DerekCaelin do you think there's a strong impetus to implement A or B in the meantime if we do plan to implement C in the future?

@shrouxm I think B would be a decent solution in the short term. It would address the problem listed above (unexpected change in ranked soils list after creating a site). If we think that including slope in location score would improve list_soils, C is an additional value add.

How much effort do you suppose it would be to implement B?

shrouxm commented 2 weeks ago

@shrouxm Does the data being pulled from the map include slope (which may or may not be accurate)?

@CourtneyLee333 the algorithm does pull possibly inaccurate slope data from maps, but it includes that slope measurement in the data score rather than the location score which is what is causing this bug (since it does not get incorporated into the matches shown in a map popover, but then does get incorporated into the data score once you create a site)

How much effort do you suppose it would be to implement B?

@DerekCaelin I think this kind of depends on the answer to the question posed in @CourtneyLee333's earlier slack thread about how to present this to the user, since any way I can think of implementing "don't call rank until data is entered" would require user facing changes. If it's just not showing the soil properties score before data is entered, I would say low effort, 1-3.

shrouxm commented 2 weeks ago

wait, scratch that last thought, i got lost in courtney's thread and forgot this also affects the match list. assuming we similarly do a very simple thing and just use the location score until the user enters data, definitely 3 and not 1!

and yeah i guess to re-iterate, without solving C we are trading this issue for a different issue by pursuing B, since while the scores will remain consistent when the user first creates a site, the user might then find that when they input a texture value which matches a soil type, that soil type's score then goes down (because we're incorporating some assumed slope data or some such)

DerekCaelin commented 1 week ago

We agreed in our algo call that: