malfp / tormalwarefp

Traffic analysis for Tor-based malware detection and classification
MIT License
37 stars 4 forks source link

How to convert a pcap file into a cell file #10

Closed pikeyang closed 9 months ago

pikeyang commented 1 year ago

How to convert a pcap file into a cell file? And is this process reversible?

malfp commented 1 year ago

A cell file contains following components:

  1. Cells from top-k (k=3 in the paper), highly active Tor connections in a PCAP
  2. Host level features extracted using zeek logs of a PCAP

Cell extraction is done as follows:

  1. Generate zeek logs for all PCAPs

  2. Filtering out Tor connections from a PCAP: Using Tor consensus document and zeek conn.log we match the Destination (IP, Port) of all TCP connections seen to the IP, port of Tor entry guard routers as noted in the consensus. If we find a match, we further check ssl.log for the matched connection uid. If the _servername field has a high entropy (>=3), length >=4 and ends with '.net' or '.com' . These checks tell us if the TCP connection is a Tor connection. (Note: Tor entry routers have server names with random characters: eg: www.xfjkdit7helsougt.net).

  3. Extract cells for the top-k Tor connections using the PCAP: For each Tor connection, we parse packets belonging to it using Python dpkt to read the PCAP. We parse the TLS stream to get the number of Tor cells in a packet (i.e divide the length of TLS application record/514; size of a single Tor cell). Note down the direction of the packet (INCOMING=> - / OUTGOING => +) and timestamp w.r.t first packet of the Tor connection.

  4. Generate the cell file: Using cells, direction and timestamps collected in step 3 for top-3 Tor connections, we write each cell with it's time and direction per line in the following format: Rank of Tor connection#time\tdirection For eg: The first incoming packet containing 2 Tor cells belonging to the Tor connection with most activity (highest number of cells) will be written as: 1#0.00\t-1\n 1#0.00\t-1\n Similarly, for the 2nd most active Tor connection, with similar activity would be written as: 2#0.0\t-1\n 2#0.0\t-1\n

The last line will contain the extracted host features: ##HOST_FTS,x1,x2,x3,x4.......x40

malfp commented 1 year ago

We have updated the repo following this issue:

cellparser.py : Contains the cell extraction logic (clean_parse())

extract_host_fts.py : Host level feature extractor with features introduced in the paper.

kiki-sys commented 12 months ago

Hello,In cellparser.py file b = raw(pkt.payload.payload.payload), how is the raw() function defined? and packets = rdpcap(fdir),how is the rdpacp() function defined?

pdodia commented 12 months ago

Notice the import on line 9. All functions in the Python Scapy library are imported. The referenced functions are part of the library.

kiki-sys commented 12 months ago

The host_features(dtype, fpath, torconns) function in extract_host_fts.py is a module or function library and does not contain the execution part of the program. What is the Python file that calls this host_features function?

malfp commented 12 months ago

host_features() is defined in lines 303-311 in extract_host_fts.py and is not a python library function. It can be used to extract host level features introduced in our work to create the cell files.

Note that each cell file has the host features written to them at the end (dataset D5 contains these cell files). This script can be used in combination with the cell extraction cellparser.py to re create the cell files when a raw PCAP is given as input. The scripts used in dataset preparation from raw PCAPs are not included in our public repo.

These helper scripts are shared for the user to help reconstruct existing or advance dataset for their purposes.

121Hq commented 12 months ago

Hello! How can I obtain the Tor consensus document? Is it acquired by running on hybrid-analysis.com? Looking forward to your response! Thank you very much!

121Hq commented 9 months ago

Hello! “Using Tor consensus document and zeek conn.log we match the Destination (IP, Port) of all TCP connections seen to the IP, port of Tor entry guard routers as noted in the consensus. ” How can I obtain the Tor consensus document?