robert-burger / libethercat

EtherCAT master library. This library is used to build a deterministic fieldbus network with EtherCAT components.
https://www.dlr.de/rm
Other
20 stars 4 forks source link

Rust wrapper causes unexpected idx received in HW_RX #5

Closed marcfir closed 4 months ago

marcfir commented 4 months ago

Hi Robert, I am working on a Rust wrapper for libethercat. I am using the example_with_dc.c as a reference. The DC example works in C with my demo setup. Going to Safeop in the Rust wrapper causes unexpected idx received in HW_RX like HW_RX : received idx 133, but we did not send one?.

Simply calling the wrapper functions in Rust according to example_with_dc.c should result in the same behavior. Adding some printf to send and recv showed that there are multiple packets being sent with the same idx. Is this a timing issue?

If I understand correctly, TX and RX are handled in different threads. The TX in the DC example is handled by the cyclic_task'. Theec_send_distributed_clocks_sync' and ec_send_process_data' queues datagrams in the tx_high queue which are then sent withhw_tx'.

The ecat.rx thread is spawned during ec_open and checks the idx in the datagram on hw_process_rx_frame. tx_send[p_entry_dg->idx] is set before the actual sending.

robert-burger commented 4 months ago

Hi Marc

Going to Safeop in the Rust wrapper causes unexpected idx received in HW_RX like HW_RX : received idx 133, but we did not send one?.

IMHO that should never happen. I usually get this if there are more than one EtherCAT master instances accidentally sending packets over the same interface.

Simply calling the wrapper functions in Rust according to example_with_dc.c should result in the same behavior. Adding some printf to send and recv showed that there are multiple packets being sent with the same idx. Is this a timing issue?

If it is the only processes it might be timing. But i don't get the point because the indices are only reused if they have also been received after the round-trip, or some timeout occurred (which i guess should be very high and usually the packet duration is less than 100us). Do you get some timeout messages? How are the priorities/scheduling of the TX and RX threads?

If I understand correctly, TX and RX are handled in different threads. The TX in the DC example is handled by the cyclic_task'. Theec_send_distributed_clocks_sync' and ec_send_process_data' queues datagrams in the tx_high queue which are then sent withhw_tx'.

Yes, that's the case if you are using the SOCK_RAW interface. If you using one of the hacked drivers ('linux' subfolder) then they can be configure in ethercat polling mode. In this mode the nic does not generate interrupt and after the TX it waits until the frame returned (so there is no RX thread).

The ecat.rx thread is spawned during ec_open and checks the idx in the datagram on hw_process_rx_frame. tx_send[p_entry_dg->idx] is set before the actual sending.

You mean the RX thread receives a datagram with idx A before the TX thread marked idx X in the tx_send array? This should not be possible because the frame is filled with datagrams here https://github.com/robert-burger/libethercat/blob/d9c124be9059a71cc81b2941cf2ec3a9b1736b1a/src/hw.c#L297 and marked in tx_send. And this happens before the send done here https://github.com/robert-burger/libethercat/blob/d9c124be9059a71cc81b2941cf2ec3a9b1736b1a/src/hw.c#L283

Just to mention i have no experience in programming in Rust, but is the wrapper code accessible anywhere?

marcfir commented 4 months ago

Hi Robert,

Do you get some timeout messages? How are the priorities/scheduling of the TX and RX threads?

I get timeouts. Here is the log (log.txt)

PS gives the following. So 99 and 100. This is equal to running the example_with_dc on my system.

 11796 main      TS  19 libethercat-rs/main
 11796 ecat.rx   FF  99 libethercat-rs/main
 11796 ecat.asy  TS  19 libethercat-rs/main
 11796 ecat.mbx  TS  19 libethercat-rs/main
 11796 ecat.mbx  TS  19 libethercat-rs/main
 11796 ecat.mbx  TS  19 libethercat-rs/main
 11796 ecat.mbx  TS  19 libethercat-rs/main
 11796 ?M7       FF 100 libethercat-rs/main

Ah I see. And getting an new idx https://github.com/robert-burger/libethercat/blob/d9c124be9059a71cc81b2941cf2ec3a9b1736b1a/src/idx.c#L56 is protected with a mutex.

So the problems are timeouts?

Just to mention i have no experience in programming in Rust, but is the wrapper code accessible anywhere?

Not yet. I have to do some clean up.

robert-burger commented 4 months ago

Yes i think it's timing (and scheduling) here. Because if the sender gets a timeout (for whatever reason) the idx is returned to the queue and reused later.

Btw: None of the threads should produce that much cpu load. I'm currently wondering why the socket timeouts for the SOCK_RAW interface are configured to 1us. (https://github.com/robert-burger/libethercat/blob/d9c124be9059a71cc81b2941cf2ec3a9b1736b1a/src/hw_sock_raw.c#L136). Can you try increasing that to e.g. 1000 us? Because then at least the ecat.rx should have much less cpu load. And because one MTU size frame has to return in less than that, this should be fine here.

marcfir commented 4 months ago

I found the problem. I missed the https://github.com/robert-burger/libethercat/blob/d9c124be9059a71cc81b2941cf2ec3a9b1736b1a/tools/example_with_dc/example_with_dc.c#L91 in the cyclic task. So the thread flooded the network.