Determining which code snippet to show, if any

gleitz commented 3 years ago

Consider the following:

I think that it is true because the variable was set

vs.

class Error():
   pass

We would want to return the larger code block

How many newlines?
How many codeblocks?
Perhaps never return <code> and only go with pre?
Length of the answer

ykskks commented 3 years ago

Hello, can you elaborate on this? I would like to work on this.

It seems to me pretty difficult to determine which coded blocks to show with rule-based approach. Larger doesn't mean better. Sometime people give "lengthy and not efficient answer" and "the shorter and cleaner answer" at the same time. Returning the larger code block might result in people getting "bad" code.

If you have new ideas about this topic, I would gladly hear about it.

gleitz commented 3 years ago

Thanks for the help @ykskks. I'll explain the current logic and what we might do to improve it.

Today, we take the first answer and try to pull out any <pre> or <code> blocks. We prefer that order becausepre is usually for multi-line content (the class Error above) and code is for inline content (true in the example above). If we don't find any of those blocks we just return the entire answer.

I was thinking we could try and develop some heuristic of when to show the entire answer vs one (or more) of these pre or code blocks. The ideas above (how many code blocks the answer contains, the ratio of the length of the block to the overall number of characters in the answer, just dropping inline code blocks, etc) were a starting point of how we might build that logic.

For what it's worth, I don't think showing the whole answer is that bad. It's probably better than extracting some code block (true) and returning that answer because it makes no sense.

A list was started with some problem queries and some thoughts on how we might build the answer selection logic. This list is a little old, so some of the answers are not the same as what they used to be, but feel free to edit that or use it for inspiration.

ykskks commented 3 years ago

Thank you for detailed answer!

I will collect more data from current version of howdoi and see how I can improve it.

Should I first share ideas with you on Issues or submit PR directly?

gleitz commented 3 years ago

Thanks! We can use this thread to post ideas - others can see it and help out too

gleitz commented 3 years ago

But when you're ready to submit a PR that's great too because I haven't really thought through how the "rules" should be applied in a clear and extensible way. So even if we don't have the perfect set of rules we can at least have the scaffolding to apply them as they grow and change.

ykskks commented 3 years ago

I tried some queries myself and found that only showing the first pre is indeed not a great idea.

Rule-based approaches that you described might work, however it is hard to come up with them or actually evaluating them. Ideally we could collect data and use ML (I am thinking simple models like Logistic Regression with texts and possible hand-crafted features) to predict which one to return.

However, it needs labeling work, which most people don't want to do. So for now, below is my idea to somewhat mitigate the problem.

only one pre exists → return pre
2 or more pre exist → return all pre with preceding sentence or paragraph
only one code exists → sentence or paragraph with code
2 or more code exist → all sentences or paragraphs with code

This way, we can cut out unnecessary parts and still return meaningful contents.

gleitz / howdoi

Determining which code snippet to show, if any #362