tajmone / PBCodeArcProto

PB CodeArchiv Rebirth Indexer Prototype
4 stars 0 forks source link

Logger Module: String or number as logger identifier? #17

Open SicroAtGit opened 6 years ago

SicroAtGit commented 6 years ago

Currently the logger module uses a string identifier for the loggers. This variant saves the definition of constants and is possibly more flexible, but is more error-prone (typing error).

I can easily modify the module to use the same number identifier system as it is used by PureBasic. #PB_Any is then also possible.

I find it difficult to decide between the two variants.

Which variant do you think is better?

tajmone commented 6 years ago

I'm looking at the source of _Sicro_CodeSharing/Logger.pbi, but don't have a link for the other version you mentioned.

I'm also facing similar doubts whenever I move the current code to a module, and I often end up stucked for days before I can decide how to approach it. The difficult part is trying to imagine how different tools will be using the code.

Also, experience taught me that PB modules can easily bring up issues if one is not careful in design choices — but modules are great to use.

From what I see in Logger.pbi, currently logger names are customly defined by the caller:

AddLog(LoggerName$, Text$, Level = 0)

... where LoggerName$ is then mapped. Is this what you are thinking of changing, ie, instead of using names you asign each module an autogenerated ID?

I can easily modify the module to use the same number identifier system as it is used by PureBasic. #PB_Any is then also possible.

If the generated number is held by module variable, there shouldn't be any conflicts. But the the other modules will need to store the returned ID in a variable to reference the logger, which then will have to be passed around between modules. So, in one case it's an integer number, in the other it's a string.

I'm trying to understand better how the logger is intended to be used. Are there going to be a number of preset loggers, or each tool can create any number of loggers it wishes?

It's tricky, I'm facing the same issue with the Errors module, and still can't work out if all possible Errors should be defined in the Errors Module, or if eaach module should be able to register the errors that it could produce — the latter seems more logical, since each module will need to handle the errors it can produce, but some errors are common to different modules/tools. The former has the advantage that it offers a centralized reference for Errors, and any changes will affect all tools, but although this is good in terms of API standards, it might introduce updates problems.

As a guideline, I'm trying to put in __mod_G__ all that is commonly shared by all modules and tools; and deciding if Error types belong in one or other is not always a clear-cut decision.

I think that loggers are similar to errors in this respect, for they will be shared by all tools and modules.

Could you please illustrate me the possible uses of loggers you have in mind? (ie, how they should be usable once fully finished).

tajmone commented 6 years ago

As a practical example, currently in mod_CodeArchiv.pbi I have the Debub output of the ScanFolder() procedure that will need to be redirected to the logger module. It debugs an Ascii tree of of the Archiv folders and files being scanned, and provides details of which folders are ignored or counted as categories, and so forth with resource files.

Usually, one doesn't need this info, but in case of problems it would be useful to access it in order to check if some folder are being skipped or included by mistake. So, this info should be somehow redirected to the debugger module, which should then handle it according to the settings of the invoking tool (display it, discard it, whatever).

The debug text could be passed to the logger in different ways:

Clearly, the first solution will provide real-time updates about the scan process, while the latter will create a time-gap and then display a very long text all at once.

The issue here is that this module doesn't have a clue of how the main tool is going to handle this debug data. So there should be either a way to specify which type of debug text it's being send (something like a Debug Level value), or either rely on some logger reference which is passed on by the main tool, and just pass on the text without further specifications.

The tricky part here is to have an open approach that could satisfy any potential tool, wether it uses the Debug Window, the console STDOUT, or a GUI gadget (string, html, whatever).

I suspect that some predefined loggers should always be made available from the logger module, and maybe extra loggers could be created on demand.

In some cases, logger are intended to store data which might be queried later on (eg, a GUI app might offer a button to view a log of a previous operation), in other cases it should redirect the received text directly to the user benefit (Debug Window, or console output).

So, probably there are input and output loggers (always dealing with strings), and it could be imaginable that these could be pipe-chained together, so that text sent to a given input logger will be automatically piped to another output logger.

The whole idea is that tools should offer control on which info is shown and which discarded, which is stored for later use and which is consumed immediately. Some of the log strings could easily become very long, and end up consuming memory, so we might have to consider wether is worth storing some of them or wether is best to just recreate them on demand (when possible).

For example, the mod_CodeArchiv.pbi will provide a procedure to create a text string representing the Ascii Archiv tree; this will be generated on demand, since it doesn't need to rescan the whole project but just to convert the list of categories to an Ascii tree.

But the above mentioned log of the folder scan, on the other hand, should be preserved (if settings require so) so that a tool can show it on demand — because re-scanning the whole project might reset some other features which follow this step. But because the string is really long, it might be worth leaving to the logger module to handle it (eg., the logger might decide to store it in a temporary file to preserve memory).

It's difficoult to predict every possible scenario, but the whole idea of a logger module is that it should provide a simple API and handle things behind the scenes. So adding file caching to the logger shouldn't affect the code that uses it.

The difficult task at present is to imagine the different uses of the logger.

tajmone commented 6 years ago

I've now started to look into the updated logger module ... I can see the direction it's going to, and it looks really great and flexible.

tajmone commented 6 years ago

After a closer look to the new logger module, and back to the original question:

Currently the logger module uses a string identifier for the loggers. This variant saves the definition of constants and is possibly more flexible, but is more error-prone (typing error).

I can easily modify the module to use the same number identifier system as it is used by PureBasic. #PB_Any is then also possible.

I find it difficult to decide between the two variants.

I wouldn't know what to decide either. The string identifier might have as a pro that it maked code more intuitive to read (but add some penalty from a memory point of view), but then the incremental numeric identifier will have to be stored in a custom var, so the var would get an intuitive name too.

Unless manipulation of logger IDs' strings is a need, using either a #PB_Any like approach, or even enumerators, should be good. I personally tend to prefer (in general) usage of Enums to automatically assigned IDs, but in some context the latter might be better. Does the system you have mind allow both usages? (ie: getting an auto-genered ID num, or specifying one manually)

SicroAtGit commented 6 years ago

I have uploaded the second variant of the logger module (with number ID): mod_Logger_NumberID.pbi

As mentioned above, PureBasic uses internally the same way of ID creation/management for Windows/Gadgets/Images/etc.

Check out Fred's posts in this forum thread: https://www.purebasic.fr/english/viewtopic.php?f=13&t=70543 At the middle of page 2 and at the top of page 3 he also mentions problems with this approach:

There is a waste of resource for indexed number as it uses an array internally. So if you use a #Sound number of 5000 for your first sound, 5000 slots (0-4999) will be unused.

But there is a chance of conflit if you use too high indexed numbers, as #PB_Any returns the memory pointer of the list item (if you use an index number of 20 million and the memory allocator returns a pointer with 20 million, you can't tell who is who).

However, these two problems can be ignored. For example, no one will create millions of loggers in their code.

I'll answer more later.

SicroAtGit commented 6 years ago

I have now modified the module mod_CodeArchiv so that it uses the module mod_Logger for debug outputs: mod_CodeArchiv_with_LoggerModule.

tajmone commented 6 years ago

Thank you very much, and I apologyse for the delay in replying — but today my monitor fried and died (R.I.P.), so I had to purchase a new one and the whole day went by...

I really like the logger module you wrote, very flexible and can be passed around between main code and modules, so it leaves all options open to the tool creator on how to redirect output from modules.

I'm about to finish the categories iterator code and the integrate the CodeArchiv module into the Pages Generator (instead of the old system), and then I'll integrate the logger in all the modules and main code.

String ID vs #PB_Any

I was also looking at the alternative logger you created using autogenreated ID (the #PB_Any way). I was wondering, for hte latter approach, why not use the PureLibrary built-in object functions to let PB internals deal with managing IDs?

I had once tested this approach by adapting the "Ticket" example of PB SDK to PureBasic source, instead of C. To access the PB API built-in functions to handle auto-generated IDs I had used:

CompilerIf #PB_Compiler_OS = #PB_OS_Windows
  Import ""
  CompilerElse
    ImportC ""
    CompilerEndIf

    PB_Object_GetOrAllocateID(*Object, ID)
    PB_Object_GetObject(*Object, ID)
    PB_Object_FreeID(*Object, ID)

    PB_Object_Init(StructureSize, IncrementStep, *ObjectFreeFunction)
    PB_Object_CleanAll(*Object)

    PB_Object_Count(*Object)
    PB_Object_IsObject(*Object, DynamicOrArrayID)

    ; original C source of 'PB_Object_IsObject':
    ; M_PBFUNCTION(void *) PB_Object_IsObject       (PB_Object *Object, integer DynamicOrArrayID);

  EndImport

... and the used the API directly from PB source. I got the idea from looking at Tailbyte source code, which uses this hack in different places. The test code I wrote works, but it's very unpolished and chaotic, but if you need it I can clean it up and share it.

This might have some advantage in terms of memory management, since this is the way that PB natively handles dynamic objects cross platform. Or maybe not, I'm not 100% sure. I had tested this in order to be able to use native PB code to access some API functions without having to write a full library in C, and to some extent it's achievable and has obvious advantages in terms of maintainance.

tajmone commented 6 years ago

Add Nul Device?

I was thinking that it might be worth adding #DeviceType_Nul to the devices, so the main tool can use it to discard certain outputs (eg, debug info from certain modules, or of a certain level, etc.). With this fifth built-in device the logger should cover all possible uses since the #DeviceType_Callback can handle custom uses (of course, a Nul device could be created by using a do nothing Procedure as callback, but I think its use will be common enough to justify its implementation).

SicroAtGit commented 5 years ago

why not use the PureLibrary built-in object functions to let PB internals deal with managing IDs?

I haven't worked with the PB-internal commands yet. Therefore, I found it easier and faster, to reprogram the functions with PB code, than to additionally study the PB-internal functions before.

I think, the ID variant is better, because string IDs have no auto-completion and the PB compiler doesn't notice, if you have a typo on a string ID somewhere in the code.

I therefore tend more towards this variant: mod_Logger_NumberID.pbi

Add Nul Device?

All devices of the current logger are processed. Adding a device to the logger that does nothing does not change the fact that the other devices are being processed.

To prevent the current logger from outputting an output, simply do not add a device to the current logger.

tajmone commented 5 years ago

You're right, mine was just an idea that passed to my mind as I had seen it used in a few projects (and thought it might offer some advantages).

I too prefer numeric IDs to string identifiers.

As for the loggers module, I have been struggling to work out how to register the various modules to the logger — this is actually one of the main blockers so far. The point is that the whole idea is that different tools might use some modules and not others, and will also require to redirect some output to a certain logger (on discard it) depending on what the tools does.

I've struggled with this quite a lot, and tried some local tests, but couldn't yet figure out how to code the modules in a way that they can be configured by the "includer tool" so that their messages will be handled according to the tool's needs. Ie, the only way to do that would be to keep storing the output text in local strings, and then send them to the logger, instead of using the Debug command; but my fear is that this will end up bloating the program's execution and consume huge amounts of memory.

I was looking for a solution that wouldn't store huge strings for long periods of execution time. So far, I've been using a per-module system where output is either stored in a local string var, or just printed using Debug. Probably, right now this is fine, and postponing the logger issue to when the modules are more clearly defined might be a better idea (although it will mean rewriting again many code parts).

What I'm realizing is that, while the original intention was to create modules that were independent and which the tools would only include if strictly needed, some common features like caching and logging are making these modules interdependent, and chances are that ultimately every tool will have to use all modules.

The only workaround seems to be to use compiler directives to execute different code blocks depending on whether the logger module is being used or not — but this complicates a bit the whole design and maintainance of the modules.

As you can read in my MODULARIZATION NOTES, there are some aspects of modules interoperability which are still obscure to me and haven't managed to come up with a clean solution for. Quite frankly, PureBasic language lacks features that could make life easier in this respect, and while modules where a great addition to PB, they aren't exactly as powerful as one would expect them to be. Also, PB preprocessing is not so powerful to allow flexible design in these situations.

Quite frankly, I think that if the whole project was to be re-written using another language (for example, Nim, which is also a procedural language), this whole task would be much simpler and cleaner, and we wouldn't have to reinvent the wheel when it comes to standards (we'd have ready made libraries for YAML, markdown, caching, and other features). Probably using another language which has a rich echosystem of libraries would take up 1/4 of the time, and be easier to maintain.

Of course, this being a project dedicated to PB it becomes almost a matter of principle to keep its tools in the language. But really, these design problems which I'm struggling with right now, they wouldn't even pose a problem in many other languages (usually, there should be more than one way to skin a cat, but in PB is not always easy to find which way to skin the cat in the first place).

SicroAtGit commented 5 years ago

I was looking for a solution that wouldn't store huge strings for long periods of execution time. So far, I've been using a per-module system where output is either stored in a local string var, or just printed using Debug. Probably, right now this is fine, and postponing the logger issue to when the modules are more clearly defined might be a better idea (although it will mean rewriting again many code parts).

I also think it is better to wait until the other modules are finished before integrating the logger module. Then we have a better overview which functions the logger module should have.

What I'm realizing is that, while the original intention was to create modules that were independent and which the tools would only include if strictly needed, some common features like caching and logging are making these modules interdependent, and chances are that ultimately every tool will have to use all modules.

The only workaround seems to be to use compiler directives to execute different code blocks depending on whether the logger module is being used or not — but this complicates a bit the whole design and maintainance of the modules.

Another solution would be, if no module uses functions directly from another module, but all functions of the modules are combined in the tool code. To combine the module functions, callback procedures can be passed as parameters to the functions. Example:

Procedure CustomErrorCallback(ErrorType, ErrorMessage$)

EndProcedure

res::ParseComments(File$, List Result$(), @CustomErrorCallback())

OR

IncludeFile "mod_Logger.pbi"

res::ParseComments(File$, List Result$(), @Logger::AddLog())

But this makes tool programming quite complicated. I think it's better if each module automatically integrates its own dependent modules, so I also think that compiler directives are better.

tajmone commented 5 years ago

This is exactly one of the points I've been undecided about. In some places, it would seem more natural to let the tool's main goal handle "manually" the task of telling each module which callbacks to use, but this would result in more coding on the side of the tool developers, and less abstraction of the modules themselves — possibly, causing updates of the modules to force tool developers to update their code too.

So, having the modules communicate with each other behind the scenes for these configuration type of issues, and having the tool developers mainly deal with the Global module when it come to settings, would allow a better separation between modules and tools code — but the price is a terrible headache for the modules maintainer, because it means that all modules are entangled together.

But definitely, allowing tools to set custom callbacks instead of the default one is a desireable feature. But I'd try to keep modules intercommunication hidden to end users.

If only PureBasic supported procedures overloading, some of these tasks would have been much easier, resulting in a less verbose and more elegant code. Then we could have used same procedures names acting differently according to parameters types.

Luckily, PureBasic uses signed integers for pointers (absurd as it might sound), which allows passing negative integers to a procedure to handle default loggers types (eg, -1 = STDOUT, -2 STDERR, and so on), and positive integers can be assumed to be a pointer to a custom callback. So, unless I'm remembering this one wrong, it means that when invoking the logger the same procedure could be used to redirect the text to either one of the standard loggers built-in into the module, or passing a custom callback! This would only require checking if the parameter is NUL or negative at the beginning of the procedure (and if yes, use a Select to eval cases).

SicroAtGit commented 5 years ago

If only PureBasic supported procedures overloading, some of these tasks would have been much easier, resulting in a less verbose and more elegant code. Then we could have used same procedures names acting differently according to parameters types.

PureBasic functions can be overwritten with macros:

Procedure newDate()
  ProcedureReturn 111
EndProcedure

Macro Date()
  newDate()
EndMacro

Debug Date()

But the debug command is unfortunately not a normal function (no Debug(Text$[, DebugLevel]), but Debug Text$[, DebugLevel].

SicroAtGit commented 5 years ago

Luckily, PureBasic uses signed integers for pointers (absurd as it might sound), which allows passing negative integers to a procedure to handle default loggers types (eg, -1 = STDOUT, -2 STDERR, and so on), and positive integers can be assumed to be a pointer to a custom callback.

Good idea. But I don't know if we are on the safe side if we assume that the pointer addresses are always positive values. The operating system uses unsigned integers and PureBasic probably cannot force it to return a positive value for @Procedure() every time.

tajmone commented 5 years ago

I had looked into this a couple of months ago while interfacing to a C library and had to handle passing and receiving signed and unsigned integers. During that research I came across a discussion on the forum that brought up the issue of pointers being signed integers in PB. I still have the links and some notes written down somewhere.

The operating system uses unsigned integers and PureBasic probably cannot force it to return a positive value for @Procedure() every time.

I think it does, and that behind the scenes it casts the OS unsigned integer to a signed Quad before returning the address.

In fact, when I was trying to understand how to handle unsigned integers when interfacing with a C library, I discovered that you don't actually have to do anything at all, PB handles converting Quads to signed or unsigned C ints via the interfacing commands, so the only problem is when you have unsigned long integers in structured data since it forces you to carry out some conversions before passing the structure. When you pass a PB integer to a C library, you don't really have to worry about size differences, as PB will recast the integer according to the required size — and of course, on the library side the passed bits are going to be interpreted as signed or unsigned according to the type used by the library.

Memory addresses are always going to be positive numbers, which means that internally PB must be handling them as unsigned values, even though pointers are of the signed type. We'd be only needing a small range of negative numbers here (probably from -1 to -6), all of which would correspond to very high memory ranges in two-complements (0xFFFFFFFF0xFFFFFFFA in x86 mode, 0xFFFFFFFFFFFFFFFF0xFFFFFFFFFFFFFFFA in x86_64 mode). My guess is that it would be safe to assume that these are never going to be procedures addresses, since memory-wise they point to the roof of the allocated memory and would leave very few bytes to a procedure's code — probably not even enough space to handle the call frame chores of the calling convention .

But I haven't tested it, so it's just an assumption.

For some reasons, it looks like PB had to use it's only native integers types (32bit signed int on x86, 64 bit signed int on x86_64) to store pointers. My guess is that this fits into the black-box way that PB handles datatypes internally, which if I recall correctly Fred mentioned that it does handle some behind the scenes casting, and that the reason he doesn't want to implement C-like unsigned long ints is because it would introduce complications in expressions, comparisons, etc.

tajmone commented 5 years ago

I've tested the above theory and it work, and I've added a test file to the repo:

https://github.com/tajmone/PBCodeArcProto/blob/master/_tempwork/callbacks/Callbacks_test.pb

Of course, one might as well add an extra parameter to inform the procedure if it should treat the main parameter as a pointer or an enumerated value instead, and it would be probably more elegant. But in terms of having to pass around less variables, this hack could be a time saver.

Also, I think that operating systems son't actually grant to applications the use of the virtual memory space in the high ranges where such small negative numbers would overlap:

Under Windows, for example, the higher limit range for 32bit apps is 0x7FFFFFFF, and for 64bit apps is 0x7FF'FFFFFFFF — which is way far from any negative range we might ever need in this app:

https://docs.microsoft.com/en-us/windows-hardware/drivers/gettingstarted/virtual-address-spaces