glam-lab / degender-the-web

A Chrome extension that replaces all gendered pronouns with "they/them/their."
GNU Affero General Public License v3.0
6 stars 3 forks source link

"Gender" should not act as a stopword if it appears only in text not visible on the screen #49

Closed ProfJanetDavis closed 5 years ago

ProfJanetDavis commented 5 years ago

Since the href of a link is not visible on the screen, this is a confusing behavior.

ProfJanetDavis commented 5 years ago

Or in any other hidden component.

ProfJanetDavis commented 5 years ago

Is there a way to test if a text node is visible on screen? See https://stackoverflow.com/questions/19669786/check-if-element-is-visible-in-dom

beszel commented 5 years ago

@ProfJanetDavis Currently the extension just checks a big chunk of text HTML for stopwords, right? I'd like feedback on my plan so far, to make sure I'm not missing an obvious solution with lower overhead.

  1. Get every element in the body.
  2. For starters, check for stopwords only in the content between tags, like this: <p class='ignore-this'>Just this content here.</p> That should fix the link specific issue and some others.
  3. Then I'll change it to check the visibility of the element itself using offsetParent or something similar.
ProfJanetDavis commented 5 years ago

Yes, the extension currently checks the entire document body for stopwords: https://github.com/glam-lab/degender-the-web/blob/6af6f9826c36e6677e8515af94ddad058a98f74b/src/main.js#L85

ProfJanetDavis commented 5 years ago

I just wrote a very long screed which got deleted when I accidentally closed the tab. I'll take the opportunity to write something more concise instead.

The algorithm you propose above raises concerns for me. "Every element in the body" includes elements nested inside of other elements. That may be a problem for both efficiency and correctness. Do you see why?

One alternative is to follow a strategy similar to that used in replaceWordsInBody():

  1. Use textNodesUnder() to get all the text nodes in the body.
  2. Write a recursive function isVisible(), similar to isEditable(), to check whether each text node is visible. Call this function either before or after checking the text node's contents. (Checking a parent node's type and attributes repeatedly should be faster than searching a child node's text repeatedly.)

I'm pretty confident you could make this work.

Another possibility might be to use TreeWalker directly. I didn't do that in replaceWordsInBody() because weird things happen if you change the document tree while walking it. Hence the strategy above, where textNodesUnder() uses a TreeWalker to collect all the text nodes in an array before making any changes to the document. But for the problem posed in this issue, using TreeWalker directly might be an elegant solution. I haven't investigated in depth or thought it through carefully - you'd have to do that.

What do you think?

ProfJanetDavis commented 5 years ago

One further thought: Don't change the line I quoted above (or the mentionsGender() function). That's meant to be a very fast check on whether there's anything in the body worth examining more carefully. Any code that walks the document tree should be inside that conditional, similar to the final check for gender pronouns.

I fear that main() is getting far too complicated. If you see opportunities to refactor, by all means make a proposal.

beszel commented 5 years ago

This all looks good. I'm working on a solution that uses textNodesUnder, and I've written isVisible and a new function visiblyMentionsGender. Do you know of a page that produces the issue? I'd like to use it for testing.

ProfJanetDavis commented 5 years ago

The page that surfaced the issue was test.html, when it linked to some articles it no longer links to. I changed it so it would be useful for the purpose I made it for. I have seen the issue arise since then, but I'm afraid I didn't note the URLs. (I have been trying to get better at that.)

Here's a contrived HTML document that illustrates as many concerns as I can think of. Please forgive my technical and conceptual sloppiness.

<!DOCTYPE html>
<html>
  <body>
    <div style="display:none">
         The previously selected gender is <span id="storedGender">unknown</span>.
    </div>

    <script>
        var gender = document.getCookie("gender"); 
        document.getElementById("storedGender").innerHTML = gender;
    </script>

    <p>
        <a href="https://www.independent.co.uk/life-style/women/women-workplace-gender-pay-gap-maternity-leave-feminism-a8979581.html"
        >More women go to university than men, so why is there still a pay gap?</a>
    </p>

    <form>    
        <p>
            <!-- This should really be "gender", but the client insisted on "sex" -->
            <label>Sex:</label>
            <select id="genderOption">
                <option value="female">Female</option>
                <option value="male">Male</option>
                <option value="other">Other</option>
                <option value="unspecified">Prefer not to say</option>   
            </select>
         </p>
         <p>
            <label>Your comments:</label><br/>
            <textarea id="comments" cols="80" rows="10"></textarea>
         </p>
         <input type="submit" value="Submit">
     </form>
  </body>
</html>

Feel free to add any other causes of "non-visibility" you can think of. I'll let you know if I disagree. For this purpose, I would consider any text that can eventually be seen by scrolling to be user-visible.

I wish I already had those automated end-to-end tests in place!

ProfJanetDavis commented 5 years ago

I actually just noticed that the third answer includes someone else's definition of isVisible which you might want to compare to yours: https://stackoverflow.com/questions/19669786/check-if-element-is-visible-in-dom

beszel commented 5 years ago

I've implemented a solution that works for the HTML above using only offsetParent. The extension still works normally for other cases. As the Stack Overflow answer suggests, this is faster but less complete than using getComputedStyle. I'm inclined to stick with this low-overhead solution until we can find a case where it fails, but I'm sure I could do it another way without too much trouble.

ProfJanetDavis commented 5 years ago

Okay, sounds good. I looked for a pull request and I don't see one yet. Do you want any help with testing?

beszel commented 5 years ago

I'm going to try testing on my own, then come back with questions once I know what I don't know.

ProfJanetDavis commented 5 years ago

Excellent!

-- Janet Davis Sent from my "smart" phone, in haste, with auto-(in)correct

On Sep 4, 2019, at 8:14 AM, Ian Hawkins notifications@github.com wrote:

I'm going to try testing on my own, then come back with questions once I know what I don't know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.