nasa / trick

Trick Simulation Environment. Trick provides a common set of simulation capabilities and utilities to build simulations automatically.
Other
32 stars 16 forks source link

Job Time Log Needs Unit To Match Data #1609

Closed keithvetter closed 7 months ago

keithvetter commented 10 months ago

The log file log_frame_trickjobs.trk reports how long each Trick job takes. The log file says the job times are in seconds but in reality the data is in microseconds. To fix this, either the logged units should change or scale the value to seconds. Bottom line, if the log file says the units are in "s" the data should be in "s".

Alex wrote, "The DP_rt*.xml files have a "1e-06" scale factor that converts microseconds to seconds. However this isn't perfect, a user can change the tic value and this scale factor will scale time incorrectly. It may be time to make a change in the sim to convert the times with the correct tic value and remove the scale value from the DP files."

Keith wrote back, "I see. Using a scale factor separate from the log unit and data is not working if the DP file is not used. In our case, Long is using the koviz RT report, which processes the log files directly. Long also had to change some dt from 1.0e-6 to 1.0e-9 as well since his sim freq is 128Hz == 0.0078125. The 5 in 0.0078125 is 1.0e-7. I think maybe this brings up the tic value/scale factor issue. I think it best to keep the log file unit and data matching. It will always be consistent that way."

I hope this isn't too much, but the log data is currently something like: sys.exec.out.time {s}, job_time {s} 0.0, 15000 0.1, 20000

When it should be: sys.exec.out.time {s}, job_time {s} 0.0, 0.000015 0.1, 0.000020

I hear the complication is that the "tic" can change, but really the log file should just be independent of any outside scaling. Just think if every variable in log files were like this, you'd see a variable with unit "m", but then find out there is this outside file, like S_job_execution, or DP_rt that contains some scaling so your "m" that is logged is not "m" but really "km" or something.

PS: The frame_overrun_time and frame_sched_time also have units of {​​​​​​​​​​​--}​​​​​​​​​​​ but need to be in microseconds as well.

hchen99 commented 10 months ago

Thanks for the detailed info. We'll look into this.

keithvetter commented 10 months ago

Thank you!

hchen99 commented 10 months ago

So we're talking about the logged data when frame log is on and these logged data files are:

Please let me know if anything missing or incorrect.

Also, as Alex pointed out, the applicable DP_ xxx files need updating with scale being removed if time is updated with correct time tic value.

keithvetter commented 10 months ago

Yes, that's correct, the files produced when frame log is on is the list of files to update. The list of files you provided looks correct to me.

You have "Update if applicable" for log_frame_trickjobs.trk and log_frame_userjobs_main.trk. These two files need updating. I just tested with the SIM_ball_L2 RUN_realtime run and see the JOB times in log_frame_userjobs_main.trk do not match the units which say {s}.

And yes, instead of {μs}, I think {s} is better as logging in the SI base unit seems most natural.

Yes, and the scale would be removed from the DP_rt files.

Thanks so much.

keithvetter commented 10 months ago

Here is a plot of what I mean. The y-axis says the JOB ball.ball_floor() time is in {s} but note the spike at 4.96 is 12. The 12 is 12 {μs} not 12 {s}. I think the values should change, so the spike should really be 0.000012 as that's the real value in {s}.

trick-issue-1609

hchen99 commented 10 months ago

Thanks for the additional info, Keith! Agreed that all {s} values in both trickjobs and userjobs except sys.exec.out.time need to be updated.

keithvetter commented 10 months ago

Your welcome, Hong! I think we're on the same page, very good.

hchen99 commented 8 months ago

@keithvetter If you have a chance, please take a look at the frame log data produced by PR#1623 and let us know if you see anything else. Please be aware that log_timeline_init was in seconds before the update, the reason for so large number at beginning for log_timeline_init was those large numbers were epoch time and once realtime started, number became reasonable numbers. log_timeline logs data once realtime starts so it is correct. log_timeline_init top rows up to job 13.02 are not that small because realtime is not started yet. log_timeline and log_timeline_init logs start and and stop for each job within the frame. Other time in other frame log files should be all in seconds.

keithvetter commented 8 months ago

If I clone the latest Trick, is PR#1623 in master, or do I need to switch to a development branch?

I am working other items at present, but should be able to review this issue fairly soon, sometime in Dec. 2023. Thank you for tackling this! I'm very happy to review and test - and looking forward to it.

hchen99 commented 8 months ago

If I clone the latest Trick, is PR#1623 in master, or do I need to switch to a development branch?

I am working other items at present, but should be able to review this issue fairly soon, sometime in Dec. 2023. Thank you for tackling this! I'm very happy to review and test - and looking forward to it.

Thanks so much! Sorry, please use branch 1609-framelog-job-time that contains the latest master commit.

keithvetter commented 8 months ago

Okay, looking forward to branch 1609-framelog-job-time :-)

keithvetter commented 8 months ago

@hchen99 I tested this using the ball L2 realtime sim.

This looks good:

  1. frame_overrun_time and frame_sched_time are now in {s} rather than {--} and data looks good
  2. The logged JOB times now match the units of {s} (the old JOB logged microsecs and had {s} as unit!)

However, the following files which are auto-genned when making a sim need to have the scale factors removed. If you plot using these DPs, the scale factors multiply by 1.0e-6 and give unintuitive plots.

  1. DP_Product/DP_rt_trickjobs.xml
  2. DP_Product/DP_rt_userjobs.xml
  3. DP_Product/DP_rt_frame.xml

And arguably, I think instead of renaming the variables below, it might make sense to keep them without the "_seconds" added. The DP files mentioned above also reference the old names without the "_seconds" appendage. In order to work with older sims, I think it might be best to rename them back to what they were originally.

Instead of: trick_real_time.rt_sync.frame_overrun_time_seconds trick_real_time.rt_sync.frame_sched_time_seconds

Just keep it: trick_real_time.rt_sync.frame_overrun_time trick_real_time.rt_sync.frame_sched_time

PS: I really didn't look at the timeline business. It doesn't look like it has changed, and most likely okay.

hchen99 commented 8 months ago

@keithvetter Thanks so much for testing this out. As for following DP XML files, I double checked that the "scale" attributes that were removed after the fix. Wondering if you still see "scale" in the DP files?

sharmeye commented 8 months ago

@keithvetter I don't think we should change the variable names back to what they were originally. Time is actually measured in tics, not seconds, so I'm not convinced that we should change the fundamental unit for those overrun and scheduled times to seconds when that's not how they are tracked or measured.

keithvetter commented 8 months ago

@hchen99 I tried again. This time I removed my DP_Product directory. I now see that the scale factor is gone from DP_rt_trickjobs.xml, DP_rt_userjobs.xml and DP_rt_frame.xml. Sorry about that! I think we're good!

@sharmeye Okay, I see what you are saying. I'll argue - if you go back to Trick of yore, original original, when frame overrun was in seconds, it was "frame_overrun_time", and it wasn't renamed "frame_overrun_time_clicks" when Trick was redone to clicks. But yeah, it will be confusing in any case comparing "frame_overrun_time" and "frame_overrun_time_seconds", so really it's okay. I still prefer it to be "frame_overrun_time" and make the unit seconds... but like I say, it's arguable and adding the "_seconds" makes it explicit and it's different from the underlying "real" measurement.

Okay, so yes, looks great (and it is 3:14 pi'o'clock so it's a good omen).

keithvetter commented 8 months ago

@sharmeye I thought of an additional reason for "frame_overrun_time". If one mouse rolls over the units on "frame_overrun_time" in koviz, the units change, so you can see "frame_overrun_time" in microseconds. If you use "frame_overrun_time_seconds", it won't make sense under unit conversion. In general, if the name of a variable contains the unit, the name will be confusing under unit conversion e.g. it's not ball.state.position_meters. See below: 1609-frame-name

sharmeye commented 8 months ago

That's a good point, but that only makes me want to change the name from time_seconds to something like time_calced or something like that. It doesn't actually make me want to change the rt_sync time variable underneath. But you are right that unit conversion would make that confusing, so we'll have to address that.

keithvetter commented 8 months ago

The standard unit of time is the second, so it makes sense to me how it was originally before Trick changed to clicks. The number of clicks isn't time. The unit for the number of clicks should be {1}, I think. Time is the number of clicks multiplied by click duration, and I wouldn't call nClicks*clickDuration, "time calculated", I'd call it "time". But yes, "time_calced {s}" is better than "time_secs {s}".

sharmeye commented 8 months ago

I agree for the most part, but at the same time i don't want to disrupt the inner workings of trick by swapping which variables mean what underneath. We'll think on it some more and find something that makes us both, if not happy, equally unhappy

keithvetter commented 8 months ago

I hope I'm making it clear that Trick changed it's notion of time to clicks and this issue is really the repercussion of that. The meaning of the variables changed when Trick went from time to clicks, and now I'm saying the original variable names made more sense to begin with. frame_time is frame time. Anyhow, I understand that now the other has been adopted, and meanings have already changed, it would be difficult to go back.

I also have to change koviz real-time reporting. One benefit of calling "frame time" something completely different is that I'll key off of that and bail if the rt recording doesn't have the new variable name. The koviz real-time report is the reason this issue was raised. Long's sim went from a microsecond tic to a nanosecond tic - and that is why frame time needed to be an independent measurement in time.

Okay, I'm glad we got to go back and forth, and I'll throw out a new shortened variable name:

trick_real_time.rt_sync.frame_time

hchen99 commented 8 months ago

Totally on board with not having "seconds" in the names. In summary (please feel free to let me know if I misunderstood the request here):

keithvetter commented 8 months ago

I like: trick_real_time.rt_sync.frame_sched_time_seconds ==> trick_real_time.rt_sync.frame_time

I use frame_overrun_time in koviz' real-time report for counting the number of frame overruns. I count the number of points over zero.

Honestly, I think it's been forever confusing on the DP_rt_frame.xml plot. See this plot (which happens to go with the season): trick-rt-plot

The bottom plot in red is the frame time and makes sense that it hovers around 0.01 (the frame). The green line is negative which means it is not overrunning, plus is over... is it not confusing?

Check this out, what if I simply flip overrun time about zero and call it frame margin:

trick-frame-margin

If that was done, I'd change koviz to count overruns by asking when the frame_margin was less than zero. The plot makes more sense to me that way - positive is good and means you've got room at the end of the frame.

I'll think more, but this may be good if you can flip overrun and call it margin.

  1. frame_sched_time -> frame_time
  2. frame_overrun_time -> frame_margin
sharmeye commented 8 months ago

We were just talking about that "frame margin" plot earlier today. My only hesitation is that sometimes the info we get out of that plot appears to be contradictory, even taking into account inversing it and calling it margin. After the holidays I intend to throw some time (bugs permitting of course) to investigate whether that plot is actually telling us what we think it is. In the meantime, I'd like to leave it alone until i'm satisfied it's not an out-and-out fabrication

Edit to say: I don't want to flip the graph and change what's displayed from these charts until we assess what's happening. I have no problem with coming up with another variable for the frame overrun time in units of time.

keithvetter commented 8 months ago

I only use the "overrun time" for detecting > 0 for overrun count, so for the realtime report, we could just keep frame_overrun_time as clicks and forget about a time measurement, but in that case I'd remove the plot from DP_rt_frame.xml, or yeah, rename it and make it a measurement in time :)

I do like the idea of frame margin rather than negative overrun time, but of course, only if it is the frame margin i.e. the positive time between finishing the work of the current frame while waiting to fire off the next frame or the negative time missed if stalled on the current frame while trying to fire off the next.

Okay, next year. Sounds good.

sharmeye commented 8 months ago

Yeah I agree, a frame margin plot would be much more intuitive and useful (and would dovetail in nicely with some of the other performance tools i want to implement next year).

alexlin0 commented 8 months ago

Hey there, the ghost of Trick past chiming in. If you make a "margin" graph where frame margin is positive, then are overruns negative? When you look at a graph would you be looking for downward negative spikes to find the overruns?

I like the frame graph as it is. Overruns as positive numbers and underruns as negative. I think of it analogous to golf scoring, zero is par and negative numbers good, positive numbers bad. I rarely use the frame schedule time graph exect to confirm that one of the scheduled jobs caused an overrun.

Please leave this graph as is, you can create new graphs with new variables, but leave the old graph for the old guy. 😃

keithvetter commented 8 months ago

I think:

frame_margin {s} = -1.0frame_overrun_time{1}click_duration{s}

If DP_rt_frame.xml were to plot frame_margin scaled by -1.0:

  1. The plot labels can stay the same (keep underrun/overrun)
  2. The units will be in {s}
  3. Alex will have his same plot
  4. We'll have a new variable name that is more intuitive

If scaling the plot by -1.0 just seems ugh, then we're left with some synonym of frame_overrun_time to be frame overrun time. I'll go with the shortened "frame_overrun":

frame_overrun {s} = frame_overrun_time{1}*frame_duration{s}

PS: Percolation over the holidays may help too :)

sharmeye commented 8 months ago

Maybe we leave the plot as it is for Alex and add a third that's telling the same story in a different way like Keith is suggesting. Last I checked, plots are free and we can make as many as we like. Like my daughter says when you present her with a choice: "I'm not really an or person, I'm more of an and person."

keithvetter commented 8 months ago

There is a problem with leaving the plot as it is. Please bear with me as this is the crux of the situation. If you leave it the same, by same I mean you keep plotting frame_overrun_time {1}, you will have a DP file with a scale factor that can vary from sim to sim. The whole point of this issue is to log the frame underrun time in seconds so that the underlying logged data is in seconds forever from sim to sim. There is no way to look at the logged data right now without opening up the DP_rt_frame.xml file to tell what the actual underrun time is.

I think the frame_margin option is a "both and" solution, because it gives us an intuitive variable name and gives the plot Alex wants and we could also plot the margin in a different plot.

But if introducing frame_margin is more convoluted, the most simple solution is just to make a new frame overrun variable in {s} called "frame_overrun" and just plot that.

I just want to make sure I'm making the point that the logged data is the crux of the issue, not the plot.

keithvetter commented 8 months ago

The current plot (seen below) has frame underrun's units in {--} because the frame_underrun_time is logged unitless since it is clicks, not time: frame-underrun-unitless

PS: I made a mistake earlier in a comment, I corrected it, but meant:

frame_margin {s} = -1.0frame_overrun_time{1}click_duration{s}

alexlin0 commented 8 months ago

To the original issue, I don't have any issues with multiplying out frame_overrun_time*time_tic_value in the sim and logging it. I didn't do it before because it creates a variable that isn't used for anything except logging.

keithvetter commented 8 months ago

@alexlin0, are you suggesting 1 or 2?:

  1. Logging a new variable name like, frame_overrun, where: frame_overrun {s} = frame_overrun_time{1}*time_tic_value{s}

  2. Changing frame_overrun_time's units from {--} to {s} and making frame_overrun_time{s} = frame_clicks{1}*time_tic_value{s}

hchen99 commented 8 months ago

This is what I am getting and hope it's ok with everyone:

[!WARNING]

keithvetter commented 8 months ago

@hchen99 I think that correctly sums up the most likely direction, however, if Alex takes 2 from my last comment, it would imply changing the units of frame_overrun_time to {s} and not introducing a new variable for overrun time. That would contradict what Sean suggested, new name, but maybe it's not that bad underneath the hood.

We're almost to a consensus.

keithvetter commented 8 months ago

I am sorry, after re-reading the response from @alexlin0, I am fairly sure he meant he is okay with adding a new variable to log, so @hchen99, I think how you summarized puts us all in agreement.

hchen99 commented 8 months ago

Oh, ok. No problem @keithvetter To us that keeping the existing tic time variables as is so for sure that everywhere else would not be impacted. Glad to know that Alex is ok with adding new variables. Will do more tests and make sure needed frame log data and DP files are properly logged or generated.

hchen99 commented 8 months ago

When I am actually changing the names, thinking how about:

hchen99 commented 8 months ago

Ok, made changes as following but will see how people react to change from frame_time to frame_sched after new year.

keithvetter commented 7 months ago

Hi @hchen99. I like the simplified names better than adding extra wording to make it more specific as that only seems to confuse. This is still sort of a work around as frame_sched_time is not really time but tics, but we've already discussed that - and are keeping those variable names the way they are for historical reasons.

hchen99 commented 7 months ago

@keithvetter The latest commit of this PR has trick_real_time.rt_sync.frame_time. People seemed to be fine with it. So I don't plan on changing it to trick_real_time.rt_sync.frame_sched. If you prefer frame_sched, I can make a quick update before closing.

keithvetter commented 7 months ago

@hchen99 Thank you, I like frame_time and didn't mean to say frame_sched in my last post.

hchen99 commented 7 months ago

@keithvetter Cool, we'll close the associated PR at our next tag-up.