Initializing with tables in dedicated directory fails in 4.1.0

alexander233 commented 5 years ago

We have been using the 3.0.0 version for a few years and now tried moving to latest version 4.1.0. Our code no longer works because our tables are stored in a dedicated directory (not the jar file) which we simply pass to the Translator together with the table names during Translator instantiation. But now library initialization fails because a default table resolver is created which tries to read resources from the jar file and then throws an exception ("directory does not exist"). As a temporary fix, we deactivated the creation of a new TableResolver in Louis.getLibrary():250 and uncommented the registration of the table resolver in Louis.getLibrary():312-316 (if (!tableResolverIsRegistered) { ... } Then everything worked like a charm!

I agree that automatic table extraction from the associated jar file is very useful but it would be good to have some mechanism to avoid the problem for cases where tables are stored in the file system (which is more efficient for normal operation). One alternative might be to add a different table resolver which would be initialized to the directory or one might add a flag such as autoExtractTablesFromJar which is set to true by default but could be set to false in cases where the files are already in the file system.

bertfrees commented 5 years ago

Hi! Thanks for the report.

The reason I didn't face this issue before is probably because the way I use liblouis-java myself is always with a custom table resolver.

However it should work I think. The default table resolver first looks for tables inside the JAR, but if it can't find the table it should then fall back to the file system.

I'll double-check that it works and will then come back to you with an example.

bertfrees commented 5 years ago

Apparently there was a bug. Could you try the latest version from master?

alexander233 commented 5 years ago

Thank you for the quick update. I still get an exception. The problem is caused by the initializer of the anonymous table resolver which calls Louis.listResources() in line 256 at a time when we have not yet defined any location of where the tables reside in the file system. This results in an error "directory does not exist" which is thrown in Louis:469

alexander233 commented 5 years ago

Maybe the anonymous TableResolver should not search for resources during initialization (which is expensive and may not be needed for standard translation use cases) but call listresources only when TableResolver receives the first request for a table resolution.

bertfrees commented 5 years ago

Maybe the anonymous TableResolver should not search for resources during initialization

Yes, that makes sense. This code is only needed for the discovery of tables (the Translation.find() method), not the resolving. So it would solve your problem, but that doesn't change the fact that there is something wrong with the code. Strange, because this has been tested on Windows.

bertfrees commented 5 years ago

This code is only needed for the discovery of tables, not the resolving.

Sorry, this is not correct. It is needed for resolving too, which means that as soon as you want to load a table, you would get the "directory does not exist" error.

alexander233 commented 5 years ago

With our modifications (which just keeps the TableResolver at null and prevents calling listresources during initialization of the library), Translator works correctly, loads tables correctly from the list of tables which we supply in the constructor (for example Translator trans = new Translator("d:\projects\braille\assets\liblouis\unicode.dis,de-de-comp8.ctb");

de-de-comp8.ctb includes a reference to braille-patterns.cti which probably needs to be resolved but this does not seem to lead to a problem in our case with our TableResolver being null.

bertfrees commented 5 years ago

@alexander233 Would it be possible for you to try to debug this? We are using liblouis-java on Windows too, without a custom table resolver, but don't get your exception. So it's going to be hard for me to find the problem.

bertfrees commented 5 years ago

@alexander233 Does this have to do with https://github.com/liblouis/liblouis-java/issues/16 maybe?

alexander233 commented 5 years ago

@bertfrees Sure, I will take a shot at debugging this issue. I don't think that #16 is the same issue but I will check it out...

alexander233 commented 5 years ago

I have just forked the repository and compiled on Eclipse. On my machine the unit tests fail immediately in Louis:448. The code determines in Line 445 that my classes are in the target directory of my Eclipse Workspace ("D...\liblouis-java\bin\", not in a jar file, and then appends "org/liblouis/resource-files/tables" to that directory (which is not part of the libloius-java repository and which I have not yet created). Predictably in Louis:448 the exception follows which finds that the hard-coded tables subdirectory does not exist. Of course, I could now create the directory in the specified path, and the code would work, but currently there is no way to tell Louis that the tables directory is somewhere else.

To replicate this behavior on your machine, just rename your tables subdirectory to something other than ../org/liblouis/resource-files/tables

Please let me know if you consider this to be part of the problem or if a specific directory path for the liblouis tables should be enforced.

bertfrees commented 5 years ago

I'm not sure I understand the problem. The tests are not supposed to be run with Eclipse. There is also an integration test to make sure it still works when packaged in a JAR. I would say this is the test that really matters. If you can make that one fail, there is really a problem.

Anyway, I have pushed some commits. I hope they fix your problem. The first commit includes your change from https://github.com/liblouis/liblouis-java/pull/17/commits/9bf03aaaca6842ba13758c6ea11753b81d71b8b3 and some additional changes. In the second commit I split the JAR into two versions, a standalone one, with the tables and binaries embedded, and one without tables and binaries (how it was before 4.0.0). I suggest you try both.

alexander233 commented 5 years ago

From my perspective the root problem is: how does one specify the path to the liblouis tables (where the main context is translation). This is only relevant for situations where for whatever reasons the default tables are not in the jar (in our case: we deploy the jar as part of an application and our installer places the liblouis tables in a particular directory). That the tests failed so early (I did not expect them to pass), was only an indication of this root problem.

Your latest version overcomes the problem by just catching and suppressing the exception that is produced when the tables are not found in the default place where they are expected. This works!

But from an architectural standpoint, such an approach looks like like a last resort and may be an indication that some structural changes might be useful. Conceptually, it is great and very convenient if a user just has to put the jar in place and liblouis just works. But this comes at a cost: tables are extracted to the file system every time the application is started and placed into temporary directories where they are deleted when the virtual machine exits. Furthermore, if I remember correctly, all tables are indexed every time. Extracting files and indexing are significant overhead, especially if one usually only needs a few tables.

Therefore there should be a way to avoid this overhead - even users the of the standalone jar should be able to switch to a more efficient use of liblouis by just extracting the tables to a permanent directory and then specifying that directory when starting liblouis.

The easiest way of doing that would be by adding a static method setTablesDir(File tablesDir) - which - like setLibraryPath() - should be called before Translator methods are called. This would just add a table resolver set to that specific directory. I am not sure however, that is a good idea to index all files in that directory always. This is something that one should be able to turn off (the indexing strategy might be set as a property of the table resolver - and it might be the task of the table resolver to initiate an indexing strategy (index all tables, index on demand etc.) instead of always performing an indexing of all tables in Louis.java (which seems to be needed and efficient (compared to ad-hoc indexing performed by liblouis dll) only for rare usecases).

bertfrees commented 5 years ago

From my perspective the root problem is: how does one specify the path to the liblouis tables (where the main context is translation). This is only relevant for situations where for whatever reasons the default tables are not in the jar (in our case: we deploy the jar as part of an application and our installer places the liblouis tables in a particular directory). That the tests failed so early (I did not expect them to pass), was only an indication of this root problem.

To be honest I don't see what the failing test indicates, other than that there is something wrong with how the test is executed, namely for some reason the tables are not in the place where they were expected to be. This should never happen in a real situation.

The path to the tables directory can not be set via the "EmbeddedTableResolver" at the moment, and I don't immediately see the need to support this. The only purpose of this class is to find tables inside the JAR. You can specify the location of the tables by setting your own table resolver, or for basic usage, by using the default table resolver from the core lib, which supports the LOUIS_TABLEPATH environment variable. The latter can be achieved by not using the "standalone" version and not setting a custom resolver. (The fact that this was not possible before was a design mistake, but is fixed now.)

Your latest version overcomes the problem by just catching and suppressing the exception that is produced when the tables are not found in the default place where they are expected. This works!

I admit it's not the best solution, but it's a quick and easy one and it does exactly what I want, namely catching the case where there are no tables in the JAR. Of course by ignoring all exceptions it's harder to find bugs, so at some point I should probably change it. (My first idea was to remove the EmbeddedTableResolver class from the JAR, but that turned out to be hard to achieve.) But while it's not the most robust solution, I don't share your opinion that there is a structural problem.

But from an architectural standpoint, such an approach looks like like a last resort and may be an indication that some structural changes might be useful. Conceptually, it is great and very convenient if a user just has to put the jar in place and liblouis just works. But this comes at a cost: tables are extracted to the file system every time the application is started and placed into temporary directories where they are deleted when the virtual machine exits. Furthermore, if I remember correctly, all tables are indexed every time. Extracting files and indexing are significant overhead, especially if one usually only needs a few tables.

I acknowledge that there are cases where you don't want to use the embedded tables. This is what I do myself, and how liblouis-java has always been used prior to version 4.0.0. But like we said before there are ways to avoid the embedded tables, the most obvious being to just not use the standalone version.

I see the indexing and the discovery function as an essential part of Liblouis. If you choose not to use it you should have a damn good reason. But anyway, if I'm not mistaken, the extracting and indexing of the tables only happens when you use the "find" method the first time. So it's easy to avoid.

The indexing could in theory be done more efficiently than is the case now. For one thing, the extracting is not necessary. So that is a something we could look at.

Therefore there should be a way to avoid this overhead - even users the of the standalone jar should be able to switch to a more efficient use of liblouis by just extracting the tables to a permanent directory and then specifying that directory when starting liblouis.

The easiest way of doing that would be by adding a static method setTablesDir(File tablesDir) - which - like setLibraryPath() - should be called before Translator methods are called. This would just add a table resolver set to that specific directory. I am not sure however, that is a good idea to index all files in that directory always. This is something that one should be able to turn off (the indexing strategy might be set as a property of the table resolver - and it might be the task of the table resolver to initiate an indexing strategy (index all tables, index on demand etc.) instead of always performing an indexing of all tables in Louis.java (which seems to be needed and efficient (compared to ad-hoc indexing performed by liblouis dll) only for rare usecases).

I can only see a marginal benefit of doing this, because you can already use the default resolver of the core lib. OK, a Java method might be slightly more convenient than setting an environment variable, but then again there will be some people who would like the resolver to behave slightly different, and then you'll need more settings. I think the most straightforward is to just let people implement their own resolver. This was the reason for introducing the pluggable table resolver feature in the first place. And it really isn't difficult to do.

Regarding the idea of using the standalone JAR for extracting the tables to a permanent location: I hadn't thought of this use case before. For this case I guess I could provide some way to turn off the EmbeddedTableResolver.

Regarding your comment about the indexing, see what I wrote above. It is already "on demand". And it's completely unobtrusive: if you are not interested in the "find" functionality, you won't notice it.

alexander233 commented 5 years ago

I am willing to help. I will write a simple directory table resolver hopefully later today that just takes a directory as its input. If this is part of the jar, then it will be easy to initialize even the standalone jar file with another, permanent tables directory...

bertfrees commented 5 years ago

OK.

alexander233 commented 5 years ago

It is interesting to get a closer look at the inner workings of liblouis! I implemented a simple directory table resolver (source code is available in my fork at https://github.com/alexander233/liblouis-java/blob/master/src/main/java/org/liblouis/DirectoryTableResolver.java Tables seem to be resolved but I am getting strange jna exceptions after several nested calls to the DirectoryTableResolver (and table lists given to the Translator separated by comma can not yet be processed properly by this DirectoryTableResolver).

This looks to me that I am making simple things more complicated. Liblouis correctly has been resolving tables in a single directory without problems and this DirectoryTableResolver is now converting files to uri and url and back several times, even this simple directory table resolver causes more processing (most not relevant for the directory table case) to an anonymous Lou_TableResolver etc....

I am not sure that this kind of code would add any value to liblouis; you are probably right that just going the exception way by working with the 'normal' version of the jar and simply passing the directory to the Translator upon instantiation is the most efficient way to go for me... (I still think, though, that there should be a way so that users can specify a permanent directory for the tables even if they have the standalone jar so that they could avoid the extraction of the table files to temporary files every time that otherwise occurs every time that liblouis is run...).

This was interesting, I learned some things about liblouis but you probably agree that the DirectTableResolver is not really a useful direction to proceed with?

alexander233 commented 5 years ago

I am just finding that the jna exceptions don't seem to be related to the DirectTableResolver... They also occur when not using it .. I will look at this some more...

bertfrees commented 5 years ago

Have a look at this code which is a very simple resolver that lists table files in a specific directory.

https://github.com/liblouis/liblouis-java/blob/18dde01ecd375a7be2b188474b91e87721d9be8a/src/test/java/org/liblouis/FindTranslatorTest.java#L46-L70

It works a bit differently than your implementation, in that the list() function returns the full paths of the files, and it relies on the fact that the Translator.find() function will send these full paths to the resolver. Anyway, if you rewrite this function slightly it can be made to behave exactly like your implementation. And if you think away the part that handles the base argument, the whole code is only a hand full of lines. It does not support the "comma-separated" format, but actually I would rather get rid of that thing completely.

As to whether such a class would be a useful addition, I stay with what I said before: it could be useful for some people, but my concern is that it will not satisfy everyone's needs and at the same time it is so easy to write yourself. Creating a fully configurable table resolver that satisfies everyone's needs seems unneeded as it would defeat the purpose of the setTableResolver function.

alexander233 commented 5 years ago

I have been examining the code and would like to propose simplifying the TableResolver architecture:

The starting point would be to change the resolve(String,URL) method of TableResolver to resolve(String) and pass the baseUrl (or base file path) to the table resolver in its constructor. Instead of allowing every call to the table resolver to provide a different 'base' , we would require that one base is associated with one table resolver (this works well for the standard use cases, such as a Jar file, a single directory or a base URL and there are simple ways for extending this to support multiple bases as a special case, see further below...).

This would make code much easier. We would have a JarTableResolver which would receive the path to the jar file in its constructor, a directory table resolver (which would receive a path to a single directory in its constructor) and could also have a UrlTableResolver which would receive a base Url in its constructor against which to interprete subsequent requests to find tables. If you have a need to combine files from multiple directories, there could be a MultiDirectoryTableResolver, too. It would also be very simple to add an AggregateTableResolver (i.e. a table resolver that just holds a list of other table resolvers, to which he then passes on requests for tables until a match is found).

It would be straightforward to refactor the code in this way (it would not mean much change of code, mostly moving code into separate classes). Code clarity would be improved, code that selects which kind of table resolver is needed would be clearly differentiated from code that resolves tables and it would no longer be necessary to translate forward and backward between URLs (especially given that lou_tableResolver has no support for URLs anyway). Also URL-specific issues (such as generated table names) and jar-specific issues such as tablesStoredToFile would no longer appear in the main Louis.java class but be move to the associated UrlTableResolver or JarTableResolver. This would increase maintainability and readability of the code.

If you think this refactoring might be useful, I can take a shot at it and you can then decide later whether to incorporate it...

bertfrees commented 5 years ago

I'm not against adding more TableResolver implementations to the library per se. Although, as I said a few times before, I'm a bit skeptical about whether the usefulness outweighs the effort. On the other hand I'm all for making the code easier to read and maintain, so if a bit of refactoring can improve the code quality and at the same time provide some more convenience to the user, I guess there's nothing wrong with that. However as I'll explain next, I'm quite determined to keep the resolver API like it is now, so I don't know how much will be left of your refactoring plans with that knowledge.

The signature of the TableResolver was properly thought through and I think there is no need to change it. The reason for taking a base argument is to enable the possibility to resolve table includes to any table from any table. This is often more flexibility than is needed, but I don't want to limit it in any way. A resolver is free to ignore the argument anyway, so it doesn't add any complexity for the user. Only (a bit) for me. The reason for returning a URL is because it is convenient for the user that he doesn't need to store tables into files when they aren't already stored in files. liblouis-java will do that for him. Initially the idea was to return a InputStream, but I needed something to be able to identify a InputStream in order to resolve other tables against it, so that's how I ended up with a URL because basically a URL is a InputStream with a name. (Note that you can wrap any InputStream in a URL with a custom protocol.) For now the core C library is still file based, but the idea is that eventually it will support reading tables from streams. What's great is that users of the Java bindings will normally not even notice when this change happens.

alexander233 commented 5 years ago

Thank you for your reply and sorry for my late answer (due to vacation). I understand what you are saying and agree that it probably does not make much sense of making these changes. Again, thank you for this great library and you can close this issue.

liblouis / liblouis-java

Initializing with tables in dedicated directory fails in 4.1.0 #15