L2H cold start tag predictions

cooglejj commented 6 years ago

Thanks for the detailed feedback! I've responded to everything you mentioned and made changes accordingly. Please let me know if there's anything more I could do.

cooglejj commented 6 years ago

Thanks again for the detailed feedback. I've worked around needing temp files and resolved the other points as appropriate. If there's more I could do, please let me know.

Zerthick commented 6 years ago

We should only be including functionality that the plugin actually needs while running in the ide, anything that does not need to be executed on the user's machine should be bundled in separate scripts and put into somewhere like /src/setup/java instead of /src/main/java

cooglejj commented 6 years ago

Thanks again, I've removed the main method and done some other minor changes.

Something I think I should clarify: CustomL2H.java and the code in the TextDataLoader class (private class in L2HPredictor.java) are both directly copied from segan.jar. In the case of CustomL2H.java, it is a copy of segan's L2H.java, but with some code changes that work around some issues in the original L2H.java (also, the outputSingleModelClassificationToString method is copied with minor modifications from elsewhere in segan.jar). The code in TextDataLoader is copied from one of segan's text loading classes, but modified so it reads from an InputStream instead of a File (solving the temporary file issue we talked about).

With that in mind, I think some of these issues will make more sense. I'm using the InputStreams because it allows me to directly use segan's text loading code (in TextDataLoader) without writing a custom, specialized version for this specific use case. Given that, while I agree it is not the most efficient solution, I'm not convinced the relatively small loss in efficiency by using streams justifies creating a custom version tailored for this case. Notably, the current version works as-is if we need to work with predictions in files in the future, or even predictions from other sources such as an external website. Creating a specialized version would either remove that possibility entirely or result in further code duplication and complexity.

I think this code origin also explains why CustomL2H has methods that seemingly have nothing to do with actually running the model: I'm not trying to unnecessarily include code for building the model in the plugin. Rather, I'm taking an existing, very general class for L2H models, and trying to make the minimal necessary modifications for it to make correct predictions for the plugin. I've removed the train method, but I don't think there's much further refactoring that would add notable benefit. Given the copied nature of this class, the only reason I see for further refactoring is to reduce bloat from the jar. However, at this point the compiled class is only adding 51kb to the final jar. If bloat is a concern, I believe there are likely to be bigger and easier wins than stripping this class.

Of course, I could be overlooking something in these thoughts. Please let me know if you think there's more to it than what I've considered.

Zerthick commented 6 years ago

Ah, this makes more sense then. Is there are reason you went with copying and modifying the code in the segan jar instead of simply extending the existing segan classes and overriding the methods that you need to modify? I feel like that would make more sense from a sustainability point of view.

cooglejj commented 6 years ago

Yes, I originally tried extending segan classes (You can see a minor remanent of that in the old version of TextDataLoader, which used to extend from a segan class). In the case of TextDataLoader, I needed to change the original method to use streams instead of files, so there's little that could be done in that case.

In the case of CustomL2H, the methods I needed to modify needed to be copied to begin with (or else the necessary modifications couldn't be performed), but they also directly or indirectly called nearly every private method in the original L2H class, meaning those methods would also need to be copied. Ultimately, it meant very little total code could be saved by extending, and I also didn't like the bug potential where overridden methods would call the copied version of the private methods, and non-overriden methods would call the original, uncopied version of the private methods. Therefore, I chose to copy both because of simplicity and because it was likely that a newer version of segan would eventually make the same fix I was doing, at which point the CustomL2H could be removed and replaced with the regular L2H class anyways.

damevski commented 6 years ago

@cooglejj segan's code hasn't been modified for a couple of years ago, so, i think, most likely there won't be a better version. I think one of these two approaches for how to include L2H in StackInTheFlow makes sense to me, and please feel free to push back or for @Zerthick to chip in: 1) we modify Segan's code (as I think you have) and create a new customized jar that we pack with StackInTheFlow -- we maintain the source for the changed segan in some other repo; or 2) we strip only the stuff we absolutely need from segan, making our own de facto implementation of L2H (while attributing the code to segan) and have no dependency on his jar. Otherwise, as it looks to me, keeping more of segan's code than we strictly need, as source, feels like we are including and plan on maintaining some zombie part of segan's code.

Zerthick commented 6 years ago

I agree with @damevski I think either option would be acceptable, though I'm a fan of option 2 since it means less overall bloat.

cooglejj commented 6 years ago

I've investigated both approaches to some degree and come to some conclusions:

Approach 2 is undesirable for two reasons: Firstly, my efforts so far suggest it would take numerous hours to completely implement. Secondly, the stripping process has potential to introduce new bugs nearly anywhere in the segan code which would make them difficult to find and resolve.

Approach 1 is plausible, but it doesn't make as much sense under scrutiny. Notably, segan's L2H class is never used within the segan codebase. In other words, we can get the same effect by simply deleting the L2H class from segan.jar and using the CustomL2H class in it's place. At that point, the "changed segan in some other repo" would simplify to just the CustomL2H class by itself. Which is fine, but that then raises the question does it really make sense to have an entire separate repo devoted essentially to maintaining a single class?

In light of this, I think we should clarify what problem(s), specifically, we're trying to solve with these changes. I'm sure we all have a general sense, but I believe getting more specific will help us decide the best approach to take from this point.

damevski commented 6 years ago

If Approach 2 isn't feasible, Approach 1 will have to do... Sounds like that approach consists of a packaging segan.jar and CustomL2H.java with StackInTheFlow. I do agree that my suggestion of a separate repo is pretty heavy handed. GitHub submodules could also be an option here, but I don't personally see the advantage of using them. Maybe one of you guys is a stronger advocate and can articulate how they would help us in the long run.

If we go with the simplest approach of packaging these 2 files with the tool, I would consider renaming CustomL2H to SeganL2HAdapter (or something similar) and placing both the class and the jar in a clearly marked place in the StackInTheFlow repo. This also assumes that SeganL2HAdapter would be simple and clean, and not include any unnecessary bloat.

cooglejj commented 5 years ago

CustomL2H is now packaged inside the segan jar, and extraneous util classes are removed. Unless there's a further issue I've overlooked, I think this should be ready to use.

vcu-swim-lab / stack-intheflow

L2H cold start tag predictions #143