jankae / LibreVNA

100kHz to 6GHz 2 port USB based VNA
GNU General Public License v3.0
1.08k stars 204 forks source link

Make SCPI interface more standard #248

Closed scott-guthridge closed 4 months ago

scott-guthridge commented 5 months ago

Summary

On my bench, I have several devices connected by GPIB (IEEE488) behind a National Instruments GPIB/ENET adapter. I also have a Tek scope that implements a SCPI interface through a TCP/IP socket. I have quite a few existing programs I use for automation. I found, though, that when using an existing program as a template for the LibreVNA, I had to make a lot more changes than usual when going between devices because the LibreVNA's SCPI interface deviates the most from standard SCPI semantics.

The usual flow for an automation program is first to send a lot of commands to set up the instrument. In this example, the Tek scope over TCP. [This example, interestingly, implements a low-frequency VNA using the scope to measure amplitude and phase difference vs. the reference signal.]

    fprintf(fp, "\n*CLS\n");     /* clear any past errors */
    fprintf(fp, "DESE 60\n");   /* CME|EXE|DDE|QYE */
    fprintf(fp, "HEADer OFF\n");
    fprintf(fp, "VERBose OFF\n");
    fprintf(fp, "SOCKETServer:PROTOCol NONe\n");
    fprintf(fp, "ACQUIRE:STATE OFF\n");
    if (a_opt == 1) {
        fprintf(fp, "ACQUIRE:MODE SAMPLE\n");
    } else {
        fprintf(fp, "ACQUIRE:MODE AVERAGE\n");
        fprintf(fp, "ACQUIRE:NUMAVG %d\n", a_opt);
    }
    fprintf(fp, "ACQUIRE:STOPAFTER SEQUENCE\n");
    ... etc. ...

In this case, we sent sent 40 or 41 back-to-back commands (depending on the if/else above).

Next, check if anything so far generated an error:

    fprintf(fp, "*ESR?\n");
    fflush(fp);
    if (fgets(buffer, MAX_LINE, fp) == NULL)
        ... handle socket error ...
    status = atoi(buffer)  // actually, I use strtol and test that a numeric reply was really returned
    if (status != 0)
        ... handle error ....

The standard event status register bits are defined as:

7  PON   device was powered on
6  URQ   user request (can mean someone tried to push a button on the front-panel when locked out)
5  CME   command error
4  EXE   execution error
3  DDE   device-dependent error
2  QYE   query error
1  RQC   device wants to become the controller (of the bus)
0  OPC   operation complete  (works in conjunction with *OPC)

Not all of these bits make sense for the LibreVNA. For example, it's not clear that the GUI knows when the device was powered on, so PON may not make sense to ever set. Maybe there is no need for URQ. On the other hand, the CME bit (syntax error in command) definitely makes sense. Clearly, RQC doesn't apply. OPC can be implemented for completeness, described further below.

If no errors, we trigger an acquisition and wait for the acquisition to complete.

    fprintf(fp, "ACQUIRE:STATE ON\n");
    fprintf(fp, "*WAI\n");

The *WAI blocks command processing until the background acquisition we triggered completes. We need to synchronize with the acquisition because the next command we're going to send is a query to grab the acquired data, and we need to make sure the data is going to be there when we ask for it.

There are several alternatives to the WAI command. A very similar command is OPC?. The only difference between WAI and OPC? is that the former doesn't print any output while the later prints a "1" on a line by itself. Programmer will usually use WAI followed immediately by a :DATA...? or :CURVE...? if all they want is the data back. They'll use the OPC? method if they need to do some action in the program in response to the asynchronous command completing, for example, tell another device to do something. It's also possible to use OPC (which doesn't block command processing -- just arms a mechanism to set the OPC status flag next time all async events have completed) followed by either polling the OPC flag in the event status register, or causing the device to generate a "service request" to generate an interrupt on the host (which could be implemented over TCP/IP by sending an "urgent" message on the socket). In real IEEE 488, there is also serial poll and parallel poll, but those require having a control channel separate from the data channel [NI GPIB-ENET and Visa interfaces implement this by using multiple socket connections]. Of of all these, probably WAI and OPC? are the most useful when working with a single device over a single socket.

Finally, once we know the acquisition is complete, we can ask for the data:

    fprintf(fp, "DATA:....?\n");
    .... read the data back ...

In another issue, a comment mentioned that the OPC? command was interpreted to just wait for the previous command and not wait for asynchronous actions triggered by that command. That's not correct, though. The main purpose of WAI and OPC? are to wait for asynchronous commands, usually acquisition commands, to make data available. If the previous command's effect were synchronous (as opposed to triggering something), then of course it's finished when we get to this command. We could equally have just sent IDN? in this case since it doesn't really wait for anything. That isn't a very useful semantic for WAI/OPC?

There was also a question about what to do if acquisition averaging or continuous acquisition is enabled. Those semantics are well-defined in SCPI. In the averaging case, the operation should be considered complete after the specified number of acquisitions have been taken. In the continuous case, the operation should be considered complete when the next acquisition completes. In all of these, the main completion criterion is: when can the user send a :DATA....? command and be guaranteed there will be new output available.

The LibreVNA SCPI interface would be more useful and easier to adapt to if it didn't print responses to non-query commands, implemented WAI and OPC? to wait for acquisition complete / data available, and provided a basic implementation of the standard event status register and its related commands, at least: DESE, DESE?, ESR?, and if possible, the others: ESE, ESE?, STB?, SRE and SRE? for adherence and completeness. In the next section, I give suggestions on how to do this in a way that doesn't break existing compatibility.

Paths to implement it

Most SCPI scripts start out by sending a CLS command to clear any pending errors. This command isn't currently implemented by the LibreVNA. So a simple approach to avoid breaking existing compatibility would be to implement the CLS command and give it the side-effect of enabling standard SCPI semantics. This would disable output from non-query commands, including itself. A device-dependent command, e.g. :CONFigure:StrictSCPI OFF, could be provided to turn the responses back on if desired.

Next, the WAI command should be implemented and OPC? command modified so that both wait for data to become available. In VNA mode, these commands would block command processing until a :VNA:ACQ:FIN? would return TRUE; in SA mode, it would likewise wait for SA:ACQ:FIN? to become true. Errors shoud also wake it up. Having *OPC? wait for data doesn't really break existing programs since it only waits longer than before, but this behavior could also be conditional under :CONF:StrictSCPI as above, if desired.

A basic implementation of the standard event status register mechanism is pretty easy (actually, all of it is pretty easy). Figure 3-6 on page 894 of this Tek scope document is a better picture of the registers than those given in the SCPI or IEEE 488.2 documents. First, there is a single-byte register, device event status enable register (DESER), read with :DESE? and written with :DESE decimal-number that acts as a mask for which events should be reported. When an error/event is posted, the appropriate bit in sesr is set if it's also in the deser mask (see pseudocode below). The standard event status register (SESR) is read through ESR?, which also clears it back to zero. So at the top of the SCPI script, you'll see something like DESE 60 to enable reporting of errors CME|EXE|DDE|QYE. Then at important points, the program will use ESE? to check if any errors occurred (and clear them). These registers should be global for the device and not associated with the current TCP connection. The remaining error registers are described below.

pseudo code...

class Status
{
private:
    uint8_t deser;
    uint8_t sesr;
    uint8_t eser;
    uint8_t srer;
    bool opc_pending;

public:
    Status()
    : deser(255),  // by default, all enabled
      sesr(0),
      eser(0),
      srer(0).
     opc_pending(false)
    {}

    // *CLS
    void cls_cmd()
    {
        sesr = 0;
        // ... if there is an event queue, clear it out ...
    }

    // DESE?
    void dese_query()
    {
        printf(sock_fd, "%d\n", deser);
    }

    // DESE decimal-number
    void dese_cmd(const char *arg)
    {
        // ... make sure arg is numeric and in 0..255 ...
        deser = atoi(arg);
    }

    void post_event(uint8_t value)
    {
        sesr |= deser & value;
    }

    // *ESR?
    void esr_query()
    {
        fprintf(fp_sock, "%d\n", sesr);
        sesr = 0;
    }

If you want to generate individual event codes/messages and put them into a queue, you can, but it's not required. The standard SCPI method for reading events is documented in chapter 20, status subsystem in the SCPI command reference. That section is a little cumbersome, and I notice that Tek implements a simpler proprietary mechanism: EVQty? to get a count of the number of events in the queue, EVENT? to read and remove the next event as a numeric code, and EVMsg? to read and remove the next event as a numeric code followed by a string explanation, all one one line. These are described in the same Tek document I linked above.

The rest of the error registers are less useful in the context of one device over a TCP connection, but I would implement them anyway for completeness and conformance to the standard -- there's not a lot to them. The next register, "ESER" holds another mask. It is read through ESE? and set through ESE. Next is SBR, described in a moment. And then there is yet a third mask, SRER, read using SRE? and written using SRE. The SBR register is a status byte read through STB?. This query doesn't clear anything. SBR and the masks around it are a lot more useful in IEEE 488 where link commands and data are separate and it's possible to read the status byte through the serial poll mechanism without having to use a query command. Here, however, the only thing that reads SBR is the STB? query, and so it's simpler to just compute the value on the fly as part of the query than to actually store (and maintain) SBR in memory. Bit 5 (ESB) of SBR is true if the status register hasn't been reaped by an ESR? and has bits that are in the ESER mask set. Bit 4 (MAV) is supposed to indicate that some query command has printed output but host hasn't read it all. Over TCP/IP, it's not really possible to determine if this is true because the data could already be queued at the host, and the only way the user can see the answer, anyway, is by reading all the data so far printed (so there isn't any left now), then reading the output printed by the STB? query, so it's pretty easy to argue that MAV reported by this command should always be 0 since you have to clear out the data queue to see the answer. Bits 0-3 and bit 7 are application-specific condition flags -- you can put anything in them you want. All that's left is bit 6, MSS. Bit 6 is true iff the SBR so far computed bitwise anded with the SRER mask is not zero.

    // ... add *ESE, *ESE?, *SRE and *SRE? methods here to set and get ESER and SRER ...

    // *STB?
    void stb_query()
    {
        uint8_t sbr = 0;

        // Set ESB bit to summarize any selected conditions not yet reaped by an *ESR? query.
        if ((sesr & eser) != 0)
            sbr |= 0x20;

        // .... fill in bits 0-4, 7 of sbr from application-specific conditions of your choice, if desired ....

        // Set MSS bit if reason for generating a service request exists
        if (sbr & srer)  // we would add (& ~0x40) to the condition, but we already know it's false in sbr
            sbr |= 0x40;

        fprintf(sock_fp, "%d\n", sbr);
    }

The *STB? query is useful mainly if application-specific condition bits (0-3, 7) are used. Otherwise, all it does is provide a way to check for errors without clearing them, and checks off a required command for SCPI. Code for it is pretty minimal.

Next SCPI required command is OPC (not the query). Unlike WAI and *OPC?, this command does not wait for commands to complete. Rather, all it does is change state flags. If there are no asynchronous commands running, then it posts an OPC event (setting bit 0 of sesr if enabled by the deser mask). If one of more asynchronous commands are running, then it sets a flag to cause an OPC event to be generated next time the number of active async commands goes to zero.

    void opc_cmd()
    {
        if (number of async commands running is zero)
            post_event(0x01);
        else
            opc_pending = true;
    }

    void report_all_async_commands_finished()
    {
        if (opc_pending) {
            post_event(0x01);
            opc_pending = false;
        }
    }

The way OPC is used is the following. A program starts one or more asynchronous commands such as triggering an acquisition. Then it sends OPC. The program can continue to send commands and queries to the device. These are not blocked, though it's recommended not to start further async commands until the earlier ones are finished. Otherwise, you can get a misleading OPC status bit while newly started async commands may still be active. Now, it's possible to poll for all background commands finished by simply doing an *ESR? query and checking bit 0.

Extra information

I can help with implementation if needed.

jankae commented 5 months ago

Thank you very much for the detailed explanation. I didn't implement all these details because I didn't see much of a point - usually I just use the device-dependent commands to control my instruments. I guess the situation might be different if you have a mixture of GPIB and TCP SCPI instruments (I haven't used GPIB at all so far).

To be honest I don't see the point of some of the mandated commands (e.g. SRE). But adhering to a standard is always a good idea. Most things (like the status registers) should be pretty easy, only the WAI/OPC feature will get a little bit more complicated. Could you link me to some official SCPI syntax documentation? I could only find the SCPI Syntax & Style guide from 1999 (https://www.ivifoundation.org/downloads/SCPI/scpi-99.pdf) but for example this does not contain the DESE command you mentioned.

jankae commented 5 months ago

Partial implementation over on the SCPI_improvement branch:

Still missing/unclear:

Since this is on a different branch, there are no automatic builds. Since you offered help with the implementation I assumed you would be able to compile yourself. Let me know what you think :)

scott-guthridge commented 5 months ago

I just realized that the first level mask, device event status enable register (DESER), and its two commands, DESE and DESE?, are Tektronix extensions. They don't exist in the IEEE 488.2 standard. I apologize for that mistake. They don't need to be implemented.

Docs: the IVI foundation SCPI doc from 1999 you linked above is the current version as far as I know. It builds upon the IEEE 488.2 standard (Part 2: Codes, Formats, Protocols and Common Commands), which builds upon IEEE 488.1 (IEEE Standard Digital Interface for Programmable Instrumentation). I have a paper copy of 488.1 which I purchased from IEEE (for like $80 as I remember many years ago) -- it took me a while to be able to read it -- it's not an ideally written document. I have 488.2 in PDF form because I'm lucky enough to work for a company that provides access to all IEEE docs. I can't share the PDF, though. We have to agree not to. Plus, they mark every page with identifying information. Anyone can buy the 264 page PDF from IEEE for $51 ($41 for members). I don't personally like that IEEE keeps standards (and journal articles) behind pricey pay walls -- I find it counter to the advancement of good engineering. Best I can offer, though, is to answer questions and give suggestions on correct implementation. I'll be more careful not to let vendor extensions leak in to my suggestions.

scott-guthridge commented 5 months ago

Build worked fine. I started testing, will do much more. So far, it looks pretty good. I saw a couple odd things -- will investigate further. At one point, I got it to lock up where it wouldn't take more commands from the socket -- disconnecting and reconnecting got out of it. Question: can calls to setOperationPending come from a different thread from the one processing input from the socket? Is a lock needed around situations like the isOperationgPending and processing of the OCAS flag here?

        if(isOperationPending()) {
            OCAS = true;
        } else {
            // operation already complete
            setFlag(Flag::OPC);
        }

to make sure that completion happening right after the test can't be lost?

jankae commented 4 months ago

I think we should be fine here without a lock. All handling of SCPI commands and all calls to setOperationPending are called from the main QT event loop.

scott-guthridge commented 4 months ago

The problem I run into right away is that if I send multiple commands to the socket before asking for a response, it processes the first command and then seems to hang. Short repeat-by in python:

import socket

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
    sock.connect(('localhost', 12345))
    in_file = sock.makefile('r')
    out_file = sock.makefile('w')

    # This command works.
    print('*IDN?', file=out_file)
    out_file.flush()
    line = in_file.readline().strip()
    print(f'*IDN? response: {line}')

    # This sequence never generates a response.
    print('*CLS', file=out_file)
    print('*ESR?', file=out_file)
    out_file.flush()
    line = in_file.readline().strip()
    print(f'*ESR? response: {line}')

Here, I'm using file objects to access the socket, especially on the read side, because they automatically handle TCP sending multiple lines and partial lines. They also handle the conversion between strings and bytes, though that's more specific to python than to C or C++.

I wrote a similar test in C. In the C version, even if I flush each write individually, it still hangs because the writes are quick enough in succession that the kernel's Nagle algorithm successfully merges the writes. If I put a sleep between the writes, then it works fine. Snippet:

    fprintf(fp_out, "*CLS\n");
    fflush(fp_out);
    sleep(1);
    fprintf(fp_out, "*ESR?\n");
    fflush(fp_out);

    errno = 0;
    if (fgets(buf, sizeof(buf), fp_in) == NULL) {
    (void)fprintf(stderr, "%s: fgets: %s\n", progname, strerror(errno));
    exit(5);
    }
    buf[sizeof(buf) - 1] = '\000';
    printf("Read [%s]\n", buf);

Though I have worked with event loop environments before, I haven't specifically worked with Qt. I'm wondering: could the if statement in the TCPServer constructor here be a while? I'm thinking that if multiple lines are received, nothing signals this block again to process the remaining input until yet more input is later received.

        connect(socket, &QTcpSocket::readyRead, [=](){
            if(socket->canReadLine()) {
                auto available = socket->bytesAvailable();
                char data[available+1];
                socket->readLine(data, sizeof(data));
                auto line = QString(data);
                emit received(line.trimmed());
            }
        });
jankae commented 4 months ago

Thank you for the excellent analysis, you are completely right. I was able to reproduce the problem with your python script and changing the if to while is indeed the fix.

scott-guthridge commented 4 months ago

Fix works, thank you. Overall, it's really much improved. Already, it addresses pretty much all of the things that I was asking for. But since I'm testing, I'll report a few things I noticed.

First, the OPC? query (as opposed to command) should not cause the OPC bit in the standard event status register to be set. OPC status bit is set only after an OPC command is sent and either no async commands are running or the last one (if there can be more than one) finishes.

Second, I get a query error if I give invalid arguments to a query command, e.g. VNA:TRACE:DATA? XXX. This should be a command error, though. The IEEE 488.2 standard says query error should be used only when the controller tries to read data from the device when none was available. To put this into context, in 488, the controller has to drive bus cycles to read bytes from the device one by one. Strict interpretation would imply that it doesn't apply in the case of a TCP socket where the device can write to the socket without the host driving it. That said, my Tek scope implements it differently (over TCP socket). It gives me query error if I send a command to read back the current trace (equivalent of VNA:TRACE:DATA? Sxx) when there is no trace available. I'll leave it up to you if you want to follow 488.2 strictly or repurpose the error bit as Tek does. But either way, it probably shouldn't be used for malformed query commands.

Third and forth, exponential notation isn't accepted in numbers, and some commands still generate CMD_ERROR responses, e.g.:

VNA:FREQ:START 100E6;STOP 200E6
CMD_ERROR
CMD_ERROR

488.2 gives a sort of flow chart to describe the number format -- pretty standard format. Here's my translation of their chart to regex:

[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]\d+)?

Question: would it be helpful for me to write test cases (since that's what I'm basically already doing) that can be used in a regression test? I'm currently writing in python.

jankae commented 4 months ago

Thanks again for the feedback, should be all corrected now. I am mostly setting command errors now but to be honest I am not paying very close attention to the correct error bit - any error bit set means that a command or query was not executed correctly.

Exponential notation is now also available and CMD_ERROR should no longer be returned (only the error bit set).

would it be helpful for me to write test cases

Yes, please! Python would also be my preferred language for that. If you do have a test script, please share that with me.

scott-guthridge commented 4 months ago

After an RST command, the device goes into VNA continuous sweep mode. But section 10 "RST Conditions" of the SCPI standard says that devices should not source power from any port after an RST. This actually makes sense. Suppose, we're doing several automated tests on some delicate device. Each test would like to start out with a RST to clean up any settings from the previous test, then set whatever is needed for this test, e.g. in this case, VNA:STIMulus:LVL -30, then run the test. Shutting off outputs protects the device from possibly strong signals while the setup is being done. So, I think *RST should reset all settings as it's doing except that it should also turn off sweeps.

Status: I'm working on minimal modifications to the existing tests to make them work again. Then I'll build on that and add more tests in a few small commits. Should I open PRs for each incremental change?

I'm currently chasing a problem with output of the next query sometimes getting eaten after a :DEV:CONN. I have a lot going on, but am making steady progress.

jankae commented 4 months ago

After an RST command, the device goes into VNA continuous sweep mode. But section 10 "RST Conditions" of the SCPI standard says that devices should not source power from any port after an *RST.

Hm, I am a bit conflicted here. I admit that I did not check the standard before implementing RST. Your statement is correct. However, every single of my instrument which I have checked goes into a "default" state when I issue RST (the same state it is in after it is powered on). That includes a VNA (Siglent SNA5000A) which does perform sweeps (with active stimulus signal at port 1) after sending *RST.

Option 1: change RST to behave according to the standard

Option 2: leave it as it is

If a user really needs a RST state which creates now stimulus, there still is the possibility: The startup behavior (and thus also the RST behavior) of the GUI is defined in the preferences. It is possibly to select a setup file which should be used instead of the default configuration. This setup file could e.g. already set up a sweep with -30 dBm or just select the generator mode with all ports turned off. This option is also why I initially didn't implement the RST command at all. Depending on the preferences, the startup behavior can be different. I don't want RST to return to a fixed state (because a user may prefer different startup conditions) but was also worried that RST behavior may be confusing if it depends on the preferences.

jankae commented 4 months ago

Should I open PRs for each incremental change?

Sure, if you have something that is worth merging, I'll gladly accept PRs. If you are working on several unrelated changes, individual PRs are preferred.

scott-guthridge commented 4 months ago

I would vote for the *RST command going to a fixed state, ignoring user preferences. If it loads preferences, then automation scripts depend on external settings rather than being self-contained. A seemingly innocuous change to a setting may change the behavior of existing programs.

SCPI also makes this statement: Finally, thought should be given to implementing *RST conditions so that the incremental programming necessary after *RST for the applications be minimized. They're suggesting that scripts shouldn't have to go through the gamut of setting every possible parameter to get into a known state.

One last argument I'll give is from the perspective of software compatibility. Suppose an automation script does set every possible parameter to get to a known state, but later, a new parameter is added to the device that didn't exist at the time the script was written. And suppose that the previous test run used the new parameter and didn't set it back to the default state. Now the behavior of the existing script may change unexpectedly because it didn't know to set that new parameter back to the default. Having *RST reset all settings avoids this type of inter-dependency.

All that said, I'm not suggesting changing the power-on behavior, but only the behavior of the newly added *RST command.

jankae commented 4 months ago

Thanks for all the additional integration tests. And I agree, it is probably best to separate *RST from user preferences. I will come up with a solution for that

scott-guthridge commented 4 months ago

Random thought: you might keep the SRE, SRE? and STB? commands unimplemented for now to leave open the possibility of using them later to implement a form of 'service request'. In IEEE 488, any device can pull down on a control line to generate an interrupt on the host (solicited or not), then there are a couple different protocols the host uses to figure out which device it was and what it wants, using an out-of-band protocol to retrieve the status byte (SBR, not SESR) from the device -- it's done in a way that doesn't use or affect the normal command/response stream. You can implement something over TCP that while isn't exactly the 488 service request, has the same spirit. When service requests are enabled on the device (through SRE), the device could signal to the host by sending an urgent/out-of-band message on the socket (sending the status byte (SBR) would be appropriate). Device might send it again any time a bit has an inactive to active edge under the mask (status_byte & statsRequestEnableRegister). Host software can select on the socket or have the OS send a SIGURG signal to notify it of the OOB, and read the OOB message to get the status byte. Then it can interrogate the device (over the normal command/response stream) as needed for more details. Not only can this used to indicate completion of an async command, but it can also be used for reporting unsolicited conditions, for example, temperature/voltage over limit. Bits 7 and 3-0 of the status byte can be used to indicate these kinds of conditions. SCPI makes some recommendations for assignment of the bits: 7 for "operational status", 3 for "questionable status', 2 for "error event queued" and 1 and 0 free to the designer. It gives some examples of events that would fall into those categories. Anyway, an idea.

jankae commented 4 months ago

I have changed the behavior for *RST to use fixed values now (regardless of the user preferences). Apart from the unimplemented SRE/STB, I believe with that all the requirements for making SCPI more standard are now fulfilled

scott-guthridge commented 4 months ago

It looks great. Thanks much for doing this.